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MANAGEMENT AND ANALYSIS OF 
5 DOCUMENT INFORMATION TEXT 



CROSS-REFERENCE TO REIATED APPLICATIONS 
This application claims priority from the following U.S. 

10 Provisional Application: 

U.S. Provisional Patent Application, serial no. 60/028,437, 
David L. Snyder and Randall J. Calistri-Yeh, entitled, "Management and 
Analysis of Patent Information Text (MAPIT) " , filed October 15, 1996. 

15 CROSS-REFERENCE TO ARTICLES 

The following publications are directed to techniques for 
measuring document similarity including information directed to subject 
field coders, semantic thread analysis and/or TF.IDF techniques: 

Liddy, E.D., Paik, W. , Yu, E.S. & McVearry, K., "An overview 

20 of DR-IilNK and its approach to document filtering, " Proceedings of the 
ARPA Workshop on Human Language Technology (1993) ; 

Liddy, E.D. & Myaeng, S.H. (1994). DR-LINK System: Phase I 
Summary . Proceedings of the TIPSTER Phase I Final Report . 

Liddy, E.D., Paik, W. , Yu, E.S, & McKenna, M. (1994). Document 

25 retrieval using linguistic knowledge. Proceedings of RIAO ' 94 Conference . 

Liddy, E.D., Paik, W., Yu, E.S. Text categorization for 
multiple users based on semantic information from an MRD. ACM 
Transactions on Information Systems . Publication date: 1994. 

Presentation date: July, 1994. 

30 Liddy, E.D., Paik, W. , McKenna, M. & Yu, E.S. (1995) 

A natural language text retrieval system with relevance feedback. 
Proceedings of the 16th National Online Meeting . 

Paik, W. , Liddy, E.D., Yu, E.S, & McKenna, M. Categorizing 
and standardizing proper nouns for efficient information retrieval. 

35 Proceedings of the ACL Workshop on Acquisition of Lexical Knowledge from 
Text . Publication date: 1993. 

Paik, W. , Liddy, E.D., Yu, E.S. & McKenna, M. 

Interpretation of Proper Nouns for Information Retrieval. Proceedings of 
the ARPA Workshop on Human Language Technology . Publication date: 1993. 

40 Salton, G. and Buckley, C. Term- weighting Approaches in 

Automatic Text Retrieval. Information Processing and Management . Volume 
24, 513-523. Publication date: 1988 ("Salton reference"). 

BACKGROUND OF THE INVENTION 

45 The present invention relates to information management and, 

more particularly to the management and analysis of document information 
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text . 

We live in the information age. How prophetic the statement 
of a major computer manufacturer which said "It was supposed to be the 
atomic age, instead it has turned out to be the information age." 

Prophetic both in the impact of the age, as well as its potential for 
beneficial and deleterious effects on humankind. Faced with an explosion 
of information fueled by the burgeoning technologies of networking, inter- 
networking, computing and the trends of globalization and decentralization 
of power, today's business manager, technical professional and investment 
manager are faced with the need for careful, accurate and timely analysis 
of the deluge of information underlying their everyday decisions . Several 
factors underlie this need for prompt information analysis. First, in an 
era of ever tighter cost controls and budgetary constraints, companies are 
faced with a need to increase their operational efficiency. In doing so, 
they face the need to assimilate large amounts of accounting and financial 
information, both concerning their internal functioning as well as their 
position in the market place. Second, the omnipresent factor of 
litigation which may cost or earn a company billions of dollars. The 
outcome of such contests is often determined by which side has access to 
the most accurate information. Third, the drive for greater economies of 
scale and cost efficiencies spurs mergers and acquisitions, especially in 
high technology areas . The success of such activity is highly dependent 
upon who has superior abilities to assimilate information. Fourth, the 
explosive growth of technology in all areas, especially in biotechnology, 
computing and finance, brings with it the need to access and comprehend 
technical trends impacting the individual firm. Fifth, the globalization 
of the market place in which today's business entities find themselves 
brings with it the need to master information concerning a multiplicity of 
market mechanisms in a multiplicity of native languages and legal systems . 
Sixth, the decentralization of large industrial giants has led to the need 
for greater cross-licensing of indigenous technologies; requiring that 
companies discern precisely the quantity and kinds of technology being 
cross-licensed. 

Faced with the increasing importance of successful analysis of 
a burgeoning information stockpile, today's business professional is 
faced, as never before, with a need for tools which not only find 
information, but find the correct information, as well as, assist the user 
in drawing conclusions and perceiving the meaning behind the information 
resources discovered. 

The most typical information analysis tool available today is 
a database of text or images which is searched by a rudimentary search 
engine. The user enters a search query consisting of specific key words 
encoded in a boolean formalism. Often the notation is so complex that 
trained librarians are needed to ensure that the formula is correct. The 
results of database searches are a list of documents containing the key 
words the user has requested. The user often does not know the closeness 
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of the match until each reference cited by the search engine is studied 
manually. There is often no way to search different portions of 
documents. Finally, the output of this process is a flat amalgam of 
documents which has not been analyzed or understood by the system 
performing the search. 

The user who turns to an automated information analysis system 
is seeking not merely a collection of related documents, but the answers 
to critical questions. For example, 

"Are there any issued patents that are so close to this 
invention proposal that they might represent a potential infringement 
problem?" 

"Are the resources of company X complimentary to our own 
company such that we should consider a merger with company X?" 

"Of the court cases decided in California last year, how many 
of them involved a sexual harassment charge?" 

"What companies exist as potential competitors in the market 
place for our planned product?" 

Current analysis tools demonstrate themselves to be 
ineffective when faced with these types of issues. What is needed is an 
information analysis tool capable of analyzing, acquiring, comprehending a 
large amount of information and presenting that information to users in a 
intelligible way. 



SUMMARY OF THE INVENTION 

The present invention provides an interactive document- 
management- and- analysis system and method for analyzing and displaying 
information contained in a plurality of documents. Particular embodiments 
of the invention are especially effective for analyzing patent texts, such 
as patent claims, abstracts, and other portions of the specification. 

A method according to one embodiment of the invention includes 
generating a set of N different representations of each document, and for 
each of a number of selected pairs of documents, determining N utility 
measures, a given utility measure being based on one of the N 
representations of the documents in that pair. In a specific embodiment, 
this information is displayed as a scatter plot in an area bounded by N 
non-parallel axes, where each selected pair is represented by a point in 
N- space having its coordinates along the N axes equal to the N utility 
measures . 

In a specific embodiment, wherein N=2, the first 
representation is a conceptual -level representation such as a subject 
vector, and the second representation is a term-based representation such 
as a word vector. 

In one use scenario, the selected pairs include all pair wise 
combinations of the documents in the plurality. In another scenario, the 
selected pairs are all pair wise combinations that include a particular 
document in the plurality. 
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The use of multiple methods of analysis, such as, for example, 
word- vector analysis and semantic -thread analysis, creates synergistic 
benefits by providing multiple independent measures of similarity. A 
system which uses multiple methods together can discover similar documents 
5 that either single method may have overlooked. In the cases where both 
methods agree, the user has greater confidence in the results because of 
the built-in "second opinion". 

In accordance with another aspect of the invention, a dynamic 
concept query is performed by treating a user-specified query as a special 
10 type of document. The user can enter a list of words ranging from a 

single keyword to the text of an entire document, which is treated as a 
new document. A multidimensional array of similarity scores comparing 
that document to each existing document in the set is calculated. The 
user can then view the resulting clusters using the visualization 
15 techniques described herein. 

The invention provides for an innovative analysis tool that 
assists users in discovering relationships among thousands of documents 
such as patents. Sophisticated natural language and information retrieval 
techniques enable the user to analyze claim sets, cluster claims based on 
20 similarity, and navigate through the results using graphical and textual 
visualization. 

The invention provides further for a search routine which goes 
beyond simple keyword search; it understands the structure of documents 
such as patents and it captures concepts like patent infringement and 
25 interference. Users can browse through data visualizations (e.g., range 
query as described below), inspect quantitative score comparisons, and 
perform side-by-side textual analysis of matching patent claims. Based on 
the information gathered, users may analyze competitive patent and 
acquisition portfolios, develop patent blocking strategies, and find 
30 potential patent infringement. 

In accordance with another aspect of the invention, the 
analysis methods described herein may be applied to a set of documents 
formed by the additional step of filtering a larger set of documents based 
on a concept query. For example, a user may find it useful to examine 
35 only those patents that discuss microelectronic packaging, analyze those 

patents, and generate a scatter plot display (e.g., run a concept query to 
pick a claim of interest followed by a claim query based on such claim and 
generate an overlay plot as described below) . 

In accordance with another aspect of the invention, 

40 recognizing and exploiting the relationship between various document types 
and "compound documents" each to the other permits multi-faceted analyses 
of multiple document types. For example, a patent is a compound document 
with nested sub-document linkages to sub -components, such as claims, 
background and summary of invention, etc. A claim is also a compound 
document because it may refer to other claims. Applying this paradigm to 
document analysis strategies, claims, whether individual, nested or as an 



45 
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amalgam, may be compared to other compound document components. For 
example, comparing claims to background and summary of invention in a 
patent. Furthermore, claims and background and summary of invention can 
be compared to other documents, such as related prior art literature from 
5 other sources such as magazines or journals. This enables the patent 

practitioner to view relevant claims, background and summaries, and other 
documents (non-patents) , and cluster these together by similarity 
measures . 

In accordance with one aspect of the invention, the user may 
10 select a metric that captures the essence of the document or documents 

under analysis. For example, the legal concept of patent infringement may 
be applied to sets of patents or patent applications. In a particular 
embodiment, a similarity matching algorithm treats the exemplar part of a 
patent claim differently from the dependent parts of the claim. Thus, a 
15 kind of "cross-comparison" matching is used, wherein the combined scores 
for (1) patent A, claim X dependent and independent part(s) vs. patent B, 
claim Y, independent part and (2) patent A, claim X dependent and 
independent part(s) vs. patent B, claim Y, dependent and independent 
part(s), generate an aggregate matching (or similarity) score for patent 
20 A, claim X vs. patent B, claim Y. 

Normalization techniques deal with asymmetries in the 
matching, especially for documents of different lengths. For example, in 
the patent context, the situation where there is a short claim on "blue 
paint" and a long claim containing "blue paint." Looking at the small 
25 claim vs. the long claim appears close (since the long one at least 

contains the small one). But what of the case where it's the long claim 
vs. the small one? Standard information retrieval techniques would 
dictate that it's a poor match, since the long claim contains many 
limitations not in the small one. For patents, the 
30 "interference/infringement" match suggests that these are close, because 

if one "covers" the other, it doesn't matter which is the "query" and 
which is the "document." 

Similarity based on the legal concept of patent infringement 
and interference serves as the touchstone to analyze, cluster and 
35 visualize patents and applications. This enables users to evaluate 

incoming applications for infringement against existing patents, filter 
large sets of patents to remove reissued and derivative patents, identify 
significant claim modifications in a reissued patent and identify related 
and unrelated patents to compare the intellectual property of two 
40 businesses. 

A further understanding of the nature and advantages of the 
present invention may be realized by reference to the remaining portions 
of the specification and the drawings. 

45 BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. lA is a block diagram of a document analysis system 
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embodying the present invention; 

Fig. IB is a more detailed block diagram of the interactions 
between the user and the system during the processing of document 
information; 

Figs. 2A-2B depict off line structured document processing 
steps according to a particular embodiment; 

Fig. 3 depicts the preprocess step of off line structured 
document processing of Figs. 2A-2B according to a particular embodiment; 

Pig. 4A depicts the mapit-process step of off line structured 
document processing of Figs. 2A-2B according to a particular embodiment; 

Fig. 4B depicts the mapit-sfc step of Fig. 4A according to a 
particular embodiment; 

Fig. 5 depicts the on line concept query processing according 
to a particular embodiment; 

Figs. 6A-6B depict off line generic document processing steps 
according to a particular embodiment; 

Fig. 7A depicts claim parsing according to a particular 

embodiment ; 

Fig. 7B depicts the process-words step of claim parsing of 
Fig. 7A according to a particular embodiment; 

Fig. 8A illustrates a scatter plot visualization technique 
according to a particular embodiment of the invention; 

Fig. 8B illustrates a 2D plot visualization technique 
according to a particular embodiment of the invention; 

Fig. 8C illustrates a 3D plot visualization technique 
according to a particular embodiment of the invention; 

Fig. 8D illustrates an S-curve plot visualization technique 
according to a particular embodiment of the invention; 

Fig. 8E is a flow chart depicting the steps for generating an 
S-curve plot; 

Fig. 9A illustrates a representative sign on screen according 
to a particular embodiment of the invention; 

Fig. 9B illustrates a representative dataset select screen 
according to a particular embodiment of the invention; 

Fig. 9C illustrates a representative concept query screen 
according to a particular embodiment of the invention; 

Fig. 9D illustrates a representative concept query review 
screen according to a particular embodiment of the invention; 

Figs. 9E and 9F illustrate representative concept query 
results screens according to a particular embodiment of the invention; 

Fig, 9G illustrates a representative concept query results 
viewer screen according to a particular embodiment of the invention; 

Figs. 9H and 91 illustrate representative concept query 
results viewer screens depicting side-by-side comparisons according to a 
particular embodiment of the invention; 

Fig. 9J illustrates a representative claim viewer screen 
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according to a particular embodiment of the invention; 

Fig. 9K illustrates a representative patent viewer screen 
according to a particular embodiment of the invention; 

Fig. lOA illustrates a representative patent query screen 
according to a particular embodiment of the invention; 

Fig. lOB illustrates a representative patent query results 
screen according to a particular embodiment of the invention; 

Fig. IOC illustrates a representative patent query side-by- 
side comparison screen of claims according to a particular embodiment of 
the invention; 

Fig. lOD illustrates a representative patent query side-by- 
side comparison screen of patents according to a particular embodiment of 
the invention; 

Fig. llA illustrates a representative claim query screen 
according to a particular embodiment of the invention; 

Fig. IIB illustrates a representative claim query claim 
finding screen according to a particular embodiment of the invention; 

Fig. lie illustrates a representative claim query results 
screen according to a particular embodiment of the invention; 

Fig, llD illustrates a representative claim query side-by-side 
comparison screen of claims according to a particular embodiment of the 
invention; 

Fig. HE illustrates a representative overlay plot for a claim 
query results screen according to a particular embodiment of the 
invention; 

Fig. IIF is a flow chart depicting the steps for generating 
an overlay plot; 

Fig. 12A illustrates a representative range query screen 
according to a particular embodiment of the invention; 

Fig. 12B illustrates a representative range query results 
screen according to a particular embodiment of the invention; 

Fig. 12C is a flow chart depicting the steps for generating a 
range query; and 

Figs. 13A-13G illustrate an alternative embodiment of the 
present invention. 



DESCRIPTION OF SPECIFIC EMBODIMENTS 
A preferable embodiment of a document-management-and-analysis 
system and method according to the invention applicable to the task of 
patent search and analysis is reduced to practice and is available under 
the trade name, MAPIT™. 

A document search and analysis tool must be both fast enough 
to handle a voluminous quantity of documents and flexible enough to adapt 
to different user requirements. Other aspects of the invention are of 
particular importance to expedient, accurate and efficient document 
analysis. First, the understanding of the structure and content of 
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documents on multiple levels is useful to provide a much deeper analysis 
than generic search engines known in the art. Second, the combination of 
multiple similarity metrics is useful to achieve highly customized 
results. By contrast, search engines known in the art restrict the user 
to whatever notion of similarity was incorporated into the system by its 
designers. Third, the ability to parse structured documents, such as 
patent claims, is useful to extract their meaning. Fourth, the graphical 
display of information relevant to the user provides the user with quick 
access to the product of the analysis. 

In accordance with the invention, multiple forms of textual 
analysis used to compare documents may be combined in any particular 
embodiment. One textual analysis method, called word-vector (also 
referred to as "wordvec” and "term-based" ) analysis, focuses on the 
co-occurrences of individual words and phrases between documents under 
analysis. Stemming technology, used in conjunction with word-vector 
analysis, matches words such as "projected" and "projection". Noun 
phrases, which are the key building blocks for many documents, such as 
patents, are identified and isolated. Those technologies are not 
restricted to English, but can be applied directly to other European 
languages. Word-vector analysis may be reduced to practice using Term 
Frequency/ Inverse Document Frequency ("TF.IDF") techniques, or other 
techniques known in the art. TF.IDF techniques are further described in 
the Salton reference identfied above. 

Another form of textual analysis is called semantic thread 
analysis (also referred to herein as "subject-vector" or "conceptual 
representation" analysis) . Instead of focusing on individual words, this 
method identifies the general topics and themes in a document. It can 
determine that a patent, for example, is 35% about engineering physics, 

15% about polymer science, 20% about holography, and 30% about 
manufacturing processes. If two patents cover the same subject areas in 
the same proportions, it is likely that they are closely related even if 
they use completely different words to describe their inventions. 

Semantic thread analysis may be reduced to practice by employing subject 
field code (SFC) techniques described by Dr. Elizabeth Liddy, et al. in 
one or more of the references identified above. 

Preface on the Format of the Drawings 

Embodiments of the invention will be best understood with 
reference to the drawings included herewith. A note on the format of 
these drawings is in order. In the drawings, process steps are depicted 
as squares or rectangles. Data structures internal to the program are 
depicted as rhomboid type structures. For example, in reference to Fig. 
2A, element 10, text format file of patents in a search set is a rhomboid 
structure. Conventional data or text files are depicted as squares or 
rectangles with the upper right hand corner turned downward. For example, 
in Fig. 2A element 50 justclaims is a file which may exist on a hard disk. 
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floppy disk, CD ROM or other form of storage medium. Open ended arrows 
reflect the flow of information. Tailess arrows indicate the flow of 
processing. 

These drawings depict the processing steps, files and 
5 information according to one embodiment of the invention targeted to 

processing and understanding patents. While this serves as an excellent 
example of the features of the invention, the reader with ordinary skill 
in the art will appreciate that the invention's scope encompasses not 
merely the understanding and analysis of patents, but other documents as 
10 well. 

Table 1 provides a definitional list of terminology used 

herein. 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



Term Definition 

Claim Query A query against a collection of text documents compared 

to a part of a particular member of the collection. 



Concept Query 

Corpus 

Dataset 

Document 



DR LINK 



Patent Query 

Polysemy 

Query 



score 

Searchset 

SFC 



A query against a collection of text documents compared 
to a user input textual concept. 

A dataset, 

A document database containing documents upon which 
search and analysis operations are conducted. 

A unit of text which is selected for analysis which may 
include an entire document or any portion thereof such 
as a title, an abstract, or one or more clauses, 
sentences, or paragraphs. A document will typically be 
a member of a document database containing a large 
number of documents and may be referred to by the term 
corpus . 

Document Retrieval using Linguistic Knowledge. This is 
a system for performing natural language processing. 

This system is described in papers by Dr. Liddy 

referenced in the cross-reference section herein above. 

A query against a collection of text documents compared 
to a particular member of the collection, identified by 
the user. 

The ability of a word to have multiple meanings. 

Text that is input for the purpose of selecting a subset 
of documents from a document database. While most 
queries entered by a user tend to be short compared to 
most documents stored in a database this should not be 
assumed. 

A numerical indicator assigned to a document indicative 
of a particular characteristic, e.g. relevance to a 
query. 

A document database containing documents upon which 
search and analysis operations are conducted. 

Subject field coder. A subject field coder is a process 
which tags content -bearing words in a text with a 
disambiguated subject code using a lexical resource of 
words which are grouped in subject categories. 
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SGML 



5 



Split Dataset 



10 



Stemming 

15 



Stop Word 

20 



Standard Generalized Markup Language. Standard 
generalized markup language is comprised of a set of 
tags which may be embedded into a text document to 
indicate to a text processor how to process the 
surrounding or encompassed text. 

A dataset may be split into two distinct components in 
order to perform comparative analyses between the two 
sub-datasets. For example, a split dataset of A company 
patents and B company patents enables the user to 
discover relationships between the patent portfolios of 
the two companies . 

Stemming is a process whereby nouns are reduced to their 
most basic form or stem. For example, the words 
"processing" and "processed" are stemmed to the word 
"process" . 

One of a collection of words which are not assigned a 
semantic meaning by the system. For example, the word 
"the" . 



Stop Word 



List A list of stop words . 



2 5 Term Index 



A unique identifier assigned to each stem by a term 
indexer . 



Term Indexer 
30 

35 TFIDF 

40 Token 

Tokenize 

45 

Transitive 

Closure 

50 weight 

Word 

55 



Term indexer is a process which performs indexing on an 
input text. Indexing involves extracting terms from the 
text, checking for stop words, processing hyphenated 
words, then stemming all inflected terms to a standard 
form. Finally, a unique term index is assigned to each 
stem. 

Term Fequency/ Inverse Document Frequency. This is a 
score computed by a term indexer process . This score 
determines the relative prominence of a term compared to 
its occurrence throughout a document body, 

A white space delimited sequence of characters having a 
particular meaning. 

A process whereby input text is separated into a 
collection of tokens. 

The transitive closure of a claim is 

the claim itself and the transitive closure of all 

references within the particular claim. 

A numerical indicator assigned to a word or token 
indicative of a particular characteristic, e.g. 
relevance to a query. 

A single word, compound word, phrase or multiword 
construct. Note that the terms "word" and "term" are 
used interchangeably. Terms and words include, for 
example, nouns, proper nouns, complex nominals, noun 
phrases, verbs, and verbs numeric expressions and 
adjectives. These include stemmed and non-stemmed 
forms . 
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Hardware Overview 

The document-management-and-analysis system (the "system**) of 
the present invention is implemented in the *'C", "C++", "perl" and UNIX 
shell script programming languages and is operational on a computer system 
5 such as shown in Fig. lA. This figure shows a conventional client-server 
computer system 1 that includes a server 20 and numerous clients, one of 
which is shown at 25. The use of the term "server" is used in the context 
of the invention, where the server receives queries from (typically 
remote) clients, does substantially all the processing necessary to 
10 formulate responses to the queries, and provides these responses to the 
clients. However, server 20 may itself act in the capacity of a client 
when it accesses remote databases located on a database server. 

Furthermore, while a client-server configuration is known, the invention 
may be implemented as a standalone facility, in which case client 25 would 
15 be absent from the figure. 

The hardware configurations are in general standard, and will 
be described only briefly. In accordance with known practice, server 20 
includes one or more processors 30 that communicate with a number of 
peripheral devices via a bus subsystem 32. These peripheral devices 
20 typically include a storage subsystem 3 5 (memory subsystem and file 

storage subsystem holding computer program (e.g., code or instructions) 
and data implementing the document-management-and-analysis system) , a set 
of user interface input and output devices 37, and an interface to outside 
networks, including the public switched telephone network. This interface 
25 is shown schematically as a "Modems and Network Interface" block 40, and 
is coupled to corresponding interface devices in client computers via a 
network connection 45. 

Client 25 has the same general configuration, although 
typically with less storage and processing capability. Thus, while the 
30 client computer could be a terminal or a low-end personal computer, the 
server computer would generally need to be a high-end workstation or 
mainframe, such as a SUN spare server. Corresponding elements and 
subsystems in the client computer are shown with corresponding, but 
primed, reference numerals. 

35 The user interface input devices typically includes a keyboard 

and may further include a pointing device and a scanner. The pointing 
device may be an indirect pointing device such as a mouse, trackball, 
touchpad, or graphics tablet, or a direct pointing device such as a 
touchscreen incorporated into the display. Other types of user interface 
40 input devices, such as voice recognition systems, are also possible. 

The user interface output devices typically include a printer 
and a display subsystem, which includes a display controller and a display 
device coupled to the controller. The display device may be a cathode ray 
tube (CRT) , a flat-panel device such as a liquid crystal display (LCD) , or 
a projection device. Display controller provides control signals to the 
display device and normally includes a display memory for storing the 
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pixels that appear on the display device. The display subsystem may also 
provide non-visual display such as audio output. 

The memory subsystem typically includes a number of memories 
including a main random access memory (RAM) for storage of instructions 
5 and data during program execution and a read only memory (ROM) in which 
fixed instructions are stored. In the case of Macintosh- compatible 
personal computers the ROM would include portions of the operating system; 
in the case of IBM- compatible personal computers, this would include the 
BIOS (basic input/output system) . 

10 The file storage subsystem provides persistent (non-volatile) 

storage for program and data files, and typically includes at least one 
hard disk drive and at least one floppy disk drive (with associated 
removable media) . There may also be other devices such as a CD-ROM drive 
and optical drives (all with their associate removable media) . 

15 Additionally, the computer system may include drives of the type with 
removable media cartridges. The removable media cartridges may, for 
example be hard disk cartridges, such as those marketed by Syquest and 
others, and flexible disk cartridges, such as those marketed by Iomega. 

One or more of the drive may be located at a remote location, such as in a 
20 server on a local area network or at a site of the Internet's World Wide 
Web. 

In this context, the term "bus subsystem" is used generically 
so as to include any mechanism for letting the various components and 
subsystems communicate with each other as intended. With the exception of 
25 the input devices and the display, the other components need not be at the 
same physical location. Thus, for example, portions of the file storage 
system could be connected via various local-area or wide-area network 
media, including telephone lines. Similarly, the input devices and 
display need not be at the same location as the processor, although it is 
30 anticipated that the present invention will most often be implemented in 
the context of PCs and workstations. 

Bus subsystem 32 is shown schematically as a single bus, but a 
typical system has a number of buses such as a local bus and one or more 
expansion buses (e.g., ADB, SCSI, ISA, EISA, MCA, NuBus, or PCI), as well 
35 as serial and parallel ports. Network connections are usually established 
through a device such as a network adapter on one of these expansion buses 
or a modem on a serial port. The client computer may be a desktop system 
or a portable system. 

The user interacts with the system using interface devices 37' 
40 (or devices 37 in a standalone system) . For example, client queries are 
entered via a keyboard, communicated to client processor 30', and thence 
to modem or network interface 40' over bus subsystem 32' . The query is 
then communicated to server 20 via network connection 45. Similarly, 
results of the query are communicated from the server to the client via 
network connection 45 for output on one of devices 37' (say a display or a 
printer), or may be stored on storage subsystem 35'. 
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Fig. IB is a functional diagram of computer system 1, Pig. IB 
depicts a server 20 preferably running Sun Solaris software or its 
equivalent, and a representative client 25 of a multiplicity of clients 
which may interact with the server 20 via the internet 45 or any other 
communications method. Blocks to the right of the server are indicative 
of the processing steps and functions which occur in the server's program 
and data storage indicated by block 35 in Fig. lA. Input search set 10 
which in this embodiment is a text format file of patents to be searched 
serves as the input to query processing block 35A. Query processing 
manipulates the input data 10 to yield a searchable dataset lOA. A Common 
Gateway Interface (CGI) script 35B enables queries from user clients to 
operate upon the dataset lOA and responses to those queries from the 
information in the dataset lOA back to the clients in the form of a 
Hypertext Markup Language (HTML) document outputs which are then 
communicated via internet 45 back to the user. 

Client 25 in Fig. IB possesses software implementing the 
function of a web browser 35A" and an operating system 35B". The user of 
the client may interact via the web browser 35A" with the system to make 
(queries of the server 20 via internet 45 and to view responses from the 
server 20 via internet 45 on the web browser 35A". 

In accordance with one aspect of the invention, documents may 
be thought of as belonging to two broad categories. The first are 
structured documents; those having a very highly specific structure. For 
example the claims incorporated within patents are structured. To 
accurately compare claims from two different patents, it is necessary to 
realize that a claim may refer to earlier claims, and those earlier claims 
must enter into the analysis. Furthermore, for purposes of infringement 
analysis, it is important to treat the "head” of a chain of dependent 
claims differently from the rest of the body. The second type of document 
is a more generic form of document having no definable structural 
components referred to as generic documents . 

In a particular embodiment, the invention divides overall 
processing into an off-line processing step and an on-line query step. 

The off-line processing step will process incoming document information 
from a variety of input sources, such as a database of U.S. patents, a 
collection of documents scanned into electronic format by a scanner or a 
database of newsprint, and build from it structures which allow the system 
to be able to manipulate and interpret the data acquired from these input 
sources. The query steps on the other hand, are targeted to on-line 
interactions with the system to gain from it knowledge about information 
which the off-line step has processed. 

Off-line processincf 

Fig. 2A depicts off-line processing of structured documents 
(or claims in this example) in this particular embodiment of the 
invention. A text format file of patents search set 10 comprises the 
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input data to the system. Input data may be in multiple formats. 

Claim processing for a data set begins with the step of 
creating a justclaims file 50 for each patent in the set, pursuant to step 
102 of Fig. 2A. Each file 50 contains the text of all the claims of one 
5 patent disposed within the set. The reader of ordinary skill in the art 
will appreciate that the specific processing of this step necessarily 
conforms to the format of the source text available to the system. For 
example, if the source text is in text format, this step must process 
textual data. Next a justclaimslist 52 is produced in step 104. The 
10 justclaimslist contains the full directory path to each justclaims file 50 
in the order that they are processed. 

Pursuant to step 106, a make-claims routine is executed. This 
make-claims routine takes each of the justclaims files 50 created in step 
102 and one line from justclaimslist 52 and creates two separate files for 
15 each claim contained in file 50 (and therefore in a patent) . The first 
file, called single file 54, contains the text of one claim. The second 

file, called merged file 56, contains the text of one claim plus the text 

of the transitive closure of all claims referenced by that claim. The 

output from make-claims step 106 also includes a claimlist data structure 

20 12 and a patentlist data structure 14 (in the form of conventional binary 

data file) . The make-claims step employs numerous heuristics in an 
attempt to identify both the scope and references of the claims. For 
example: 1) Each claim must start on a new line and that line must start 

with the claim number, a period and one or more spaces; 2) Claims must 
25 be numbered sequentially starting with 1 (note that this heuristic will 
not catch the case where, for example, claim 4 has text including a line 
starting with "5.")/ 3) References to other claims are understood by the 

system, such as; a) "claim 3*', b) "Claim 3*', c) "claim 2 or 3”, d) 
"claim 2 and 3", e) "claims 2 or 3”, f) "claims 2 and 3", g) "claims 2, 
30 3, or 4", h) "claims 2-4", i) "claims 2 to 4", j) "claims 2 through 4", 

k) "claims 2-5 inclusive or 8", 1) "all previous claims", m) "any 
proceeding claims"; and 4) Claims can only refer to claims occurring 
previously in documents. It is possible but rare to legally refer to a 
future claim. It is rather common to have a typographic error refer to a 
35 future claim by mistake. If a reference to a future claim is encountered, 
a warning message is printed, the reference is skipped and processing 
continues. (This warning is forwarded to the user, who determines whether 
the reference to a future claim is intentional or a typographical error.) 
All claims referred to by the current claim, and all claims recursively 
40 referred to by any of them, are printed in the order encountered following 
the text of the current claim. The remaining heuristics are specified in 
Table 3 below. 

Preprocess step 108 has a task of taking raw document input, 
filtering from it extraneous matter and extracting root words and noun 
phrases. A commercial-off -the- shelf (COTS) language processing tool such 
as the XLT software package available from Xerox, a corporation with 
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headquarters in Stamford, Connecticut, performs much of the processing. 
Although other software such as part -of -speech (PCS) tagger of the type 
provided by such companies as Inso Corporation, Boston, Massachusetts may 
also be used. Its behavior in this embodiment is depicted with greater 
5 particularity in Fig. 3. 

Referring to flowchart 201 of Fig. 3, the preprocess step 
initially prefilters input text and removes nonlegible items, pursuant to 
step 200. Any number of appropriate heuristics may be used, such as 
dropping any words with more than fifty characters. 

10 Next, a tokenize step 210 tokenizes the document text. 

Following step 210, all words are converted to lower case pursuant to step 
220. Each word is then reduced to its derivational root in stem step 230. 
For example, the words "processed" and "processing" would be stemmed to 
the word "process" . Stems are written out in step 240 with two 
15 exceptions: 1) If the stemmed word contains anything except letters, it 

is not printed and 2) If the original word is contained in the stop word 
list, it is not printed. 

The next step is tag words step 250. All words are tagged 
with their part of speech working one sentence at a time. If the sentence 
20 has more than 1,000 tokens, the program will skip this sentence. Post 

filter step 260 removes phrases suspected of being in error. For example, 
known phrases with more than five nouns in a row are removed. I.D. noun 
phrases 270 removes extraneous noun-phrases. For example, if the phrase 
contains the word *'said” the phrase is removed. If the phrase contains 
25 the word "claim" or "claims”, the phrase is removed. Additional 

extraneous terms related specifically to the subject technology may also 
be identified and removed. 

In step Write-out noun phrases 280, all noun phrases are 
written to the standard output on a single line separated by a space. The 
30 words in the phrase are joined by an underscore {"_") . In summary, 
preprocess step 108 produces a single file for each input document 
containing the foregoing subject matter ("preprocessing text file”), 
representing preprocessed documents for subsequent analysis. 

Referring again to Fig. 2A, after preprocess step 108 
35 completes, processing continues with a build-claimlist step 110. Build- 
claimlist translates the full directory paths of the justclaims files 50 
represented in the justclaimslist file 52 as ASCII directory paths into a 
binary represented form of the directory path information stored in the 
claimlist.bin file 49. This enables later processing to work with binary 
40 represented full directory paths for these files, which is more efficient 
than working with the text represented files. 

Following step 110, a build-hash step 112 creates a series of 
hash files 60 that enable the system to rapidly access information about 
various documents and claims. Each hash file consists of two separate 
files containing mapping information linking together information about 
the documents being processed. These hash files 60 are: 1) A mapping 
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from a claim number to a unique document index, representative of each 
document being analyzed; 2) A mapping from a claim number to the full 
directory path to the text of that claim; 3) A mapping from a claim 
number to the first 150 characters of the claim; 4) A mapping from a 
5 patent number to the unique document index; 5) A mapping from a patent 

number to a full directory path to the text of that patent; 6) A mapping 
from a patent number to the full title of the patent; 7) A mapping from a 
patent number to the assignee of that patent; and 8) A mapping from a 
patent number to a space separated list of claims included in that patent. 
10 Each hash file created in step 112 is a mapping between an ASCII key 
string and an ASCII value string. 

Following step 112, a f ix-patentlist step 114 removes entries 
from patentlist data structure 14 that do not have any claims. The 
original patentlist is backed up to an original patentlist file 16. The 
15 remaining good patents (i.e., those with claims) stay in patentlist 

structure 14 . Any bad patents are written to a separate data structure 
18. Processing now continues with a mapit-process step 116 which is 
described in greater detail in flowchart 301 of Fig. 4A. 

Referring to flowchart 301, initially tf step 300 translates a 
20 set of claimlist text files 12 which have been processed by the preprocess 
step 108 in Fig. 2A into a single file 64, which consists of a list of 
each unique term in the original claimlist files followed by a covint of 
the number of occurrences of that term for each document. This file is 
the last ASCII represented file that is produced during processing. 

25 Step 310 next takes the file 64 produced by step 300 

creates four files 66 used in calculations in tfidf-all step 320. 

Included in files 66 are: 1) A hash file mapping each term in the body of 
documents being analyzed to a unique index; 2) A binary file containing a 
single integer value for the number of words in the hash file; 3) A 
30 binary version of the file created by step 300 recording the term index 

and the frequency count for each term in each document; and 4) A mapping 
of an unique index associated with each term to the number of documents 
that contain that term. The total number of terms including duplicates in 
each document is printed to a standard-out (STDOUT) and is typically 
35 redirected a to convenient file. 

Referring again to Fig. 4A, step 320 calculates actual TFIDF 
weights for each term in each document in the claimlist producing a file 
of weights 72. These weights are combined by mapit-all step 120 (Fig. 2B) 
or mapit-process-query step 420 (Fig. 5) to generate a "score” for a pair 
40 of documents. In a preferable embodiment two separate sets of weights are 
calculated for each document. The first set, query weights, is to be used 
when comparing the document against a concept query. The second set, doc 
weights, is used when comparing a document against another document. 

TF.IDF technqiues for calculating doc weights are further described in 
Salton reference identified above. 

Following tfidf_all step 320, a normalize step 330 calculates 



45 




wo 98/16890 



17 



PCT/US97/18712 



a set of normalization factors that force all document-pair scores to lie 
between 0.0 and 1.0. By definition a document compared against itself as 
a perfect score of 1.0 and no other document can score higher than 1.0. 

In a preferable embodiment, a document is scored against itself by 
5 calculating the term weights with formula (4) hereinabove and then taking 
the dot product to arrive at a normalization factor. A score for this 
document against any other document is divided by this normalization 
factor, yielding a maximum score of 1.0. 

After step 330, a make-twfmt step 340 creates an SFC input 
10 file 68 for processing by a Subject Field Coder ("SFC") . A Subject Field 
Coder (SFC) tags content -bearing words in a text with a disambiguated 
subject code using a lexical resource of words whose senses are grouped by 
subject categories. 

A subject field code indicates the conceptual -level sense or 
15 meaning of a word or phrase. The present invention, however, is not 
limited to a specific hierarchical arrangement or a certain number or 
scheme of subject field codes. 

Each information bearing word in a text is looked up in a 
lexical resource. If the word is in the lexicon, it is assigned a single, 
20 unambiguous subject code using, if necessary, a process of disambiguation. 
Once each content -bearing word in a text has been assigned a single SFC, 
the frequencies of the codes for all words in the document are combined to 
produce a fixed length, subject-based vector representation of the 
document contents. This relatively high-level, conceptual representation 
25 of documents and queries is a useful representation of texts used for 
later matching and ranking. 

Polysemy (the ability of a word to have multiple meanings) is 
a significant problem in information retrieval. Since words in the 
English language have, on average, about 1.49 senses, with the most 
30 commonly occurring nouns having an average of 7,3 senses, and the most 
commonly occurring verbs having an average of 12.4 senses, a process of 
disambiguation is involved in assigning a single subject field code to a 
word. 

Words with multiple meanings (and hence multiple possible 
35 subject field code assignments) are disambiguated to a single subject 

field code using three evidence sources (this method of disambiguation has 
general application in other text processing modules to help improve 
performance) : 

Local Context ; If a word in the sentence has been tagged 
40 with only one concept group code, this concept group code is considered 
Unique . Further, if there are any concept group codes which have been 
assigned to more than a predetermined number of words within the sentence 
being processed, these concept group codes are considered Frequent codes. 
These two types of locally determined concept group codes are used as 
"anchors" in the sentence for disambiguating the remaining words. If any 
of the ambiguous (polysemous) words in the sentence have either a Unique 
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or Frequent concept group code amongst their codes, that concept group 
code is selected and that word is thereby disambiguated. 

Domain Knowledge : Domain Knowledge representations reflect 

the extent to which words of one concept group tend to co-occur with words 
5 of the other concept groups (hence the notion of the domain predicting the 
sense) . For example, within a given sentence, a word with multiple 
concepts categories is disambiguated to the single concept category that 
is most highly correlated with the Unique or Frequent concept category. 

If several Unique or Frequent anchor words exist, the ambiguous word is 
10 disambiguated to the correct category of the anchor word with the highest 
overall correlation coefficient. 

Global Knowledge : Global Knowledge simulates the 

observation made in human sense disambiguation that more frequently used 
senses of words are cognitively activated in preference to less frequently 
15 used senses of words. Therefore, the words not yet disambiguated by Local 
Context or Domain Knowledge will now have their multiple concept group 
codes compared to a Global Knowledge database source. 

Subject field codes are further discussed in Liddy, E.D., 

Paik, W., Yu, E.S. & McVearry, K. , "An overview of DR-LINK and its 
20 approach to document filtering," Proceedings of the ARPA Workshop on 
Human Language Technology (1993) . 

Processing in step 340 concatenates all claim files together 
(e.g., single file 54 or merged file 56, etc.) and adds several Standard 
Generalized Mark-up Language ("SGML") tags as are well known in the art. 

25 (Such processing is described in greater detail by Dr. Liddy, et al, in 
"Categorizing And Standarizing Proper Nouns For Efficient Information 
Retrieval") . 

Note that since documents are represented by SFCs, which are 
language independent, a related embodiment can perform multi-language word 
30 vector analysis on sets of documents. Thus, a related embodiment could, 
for example, analyze a set of French patents. 

Mapit-sfc step 350 next performs sxibject field coding on the 
SFC input file 68 produced in step 340. Processing mapit-sfc step 350 is 
detailed in flowchart 361 of Fig. 4B. Referring to Fig. 4B, the first 
35 step of such processing is dpfilter step 360 which removes unwanted SGML 
delimited text. Following step 360, sfc- tagger step 370 uses a part of 
speech tagger to parse all documents one sentence at a time. Sfc step 380 
identifies subject field codes and the weighting for each document. 

Finally, step 390 creates a mapit. sfc. weights file 70 for all documents 
40 containing the associated subject field codes and weights. Processing 

will now continue with step 120 of flowchart 100 as depicted on Fig. 2B. 

Mapit-all step 120 creates a scores file 74 from a weights 
file 74. This file has one integer weight from 0 to 99 for every pair of 
documents in the input document dataset. For example, given documents D1 
and D2, corresponding with weight vectors wl and w2 held in a weights file 
(such as the word vector weights file 72, or the SFC weights file 70), 
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corresponding normalization constants nl and n2 held in a file (created in 
step 330, and combination function f(wl,w2) defined hereinbelow, mapit-all 
determines the maximum of a normalized similarity of weight vector with 
respect to weight vector Wj and a normalized similarity of weight vector Wj 
5 with respect to weight vector Wj. 

In a related embodiment, a cross -comparison algorithm takes 
the average of single versus merged claims and merged versus merged 
claims. For example, to implement the legal concept of patent 
infringement as applied to sets of patents or patent applications, in a 
10 particular embodiment, a similarity matching algorithm treats the exemplar 
part of a patent claim differently from the dependent parts of the claim. 
Thus, a kind of "cross -comparison" matching is used, wherein the combined 
scores for (1) patent A, claim X dependent and independent part(s) vs, 
patent B, claim Y, independent part and (2) patent A, claim X dependent 
15 and independent part(s) vs. patent B, claim Y, dependent and independent 
part(s), generate an aggregate matching (or similarity) score for patent 
A, claim X vs. patent B, claim Y. 

In cross comparison processing, weights, from either word 
vector analysis or SFC analysis, are compared from the single file, block 
20 54 of Fig. 2A, and the merged file, block 56 of Fig. 2A. For example, 

document 1 with weight vectors wls in the single file and wlm in the 
merged file is cross compared with document 2, having weight vectors w2s 
in the single file and w2m in the merged file. The cross comparison score 
is basically an average of two combination functions of single and merged 
25 weights, computed according to formula (1) : 

f ' (wl, w2) = (f (wls, w2m) +f (wlm, w2m) ) /2 , 0 . (1) 

Following step 12 0 of Fig. 2B, mapit-all -by-patent step 122 
30 aggregates claim level scores to the patent level producing a file 

containing these patent scores 76. In a preferable embodiment the score 
for patent pi versus patent p2 is the top scoring pair of any claim from 
pi against any claim from p2. Mapit-all-by-patent implements a "search 
patents by best claim" function in the preferable embodiment of the 
35 invention. The other patent level search, "search patents by all claims" 
is achieved by performing a regular query against the justclaims data set 
(i.e., all justclaims files 50 of patents in the associated search set) 
instead of the top scoring claim in the justclaims data set. 

Referring again to Fig. 2B, mapit- top-scores step 126 writes 
40 the top N scores to an ASCII format file 82. The rationale underlying this 
step is that large data file search time is expensive in terms of 
computing resources. Therefore, in a preferable embodiment, the system 
precomputes a manageable size score which is the system's "best guess" at 
what will be of interest to the user. In a preferable embodiment this is 
implemented by performing a mapit-extract step (i.e., step 300 of 
Fig. 4A) , sorting the resulting file by score, determining the value of 
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the Nth (i.e., lowest) score, doing a restricted mapit- extract step only 
down to that Nth score level, and sorting again. 

Mapit-score-range step 128 takes as its input the file 82 
created in step 126, and calculates the minimum and maximum scores for 
5 both word vector analysis and SFC type scores. It then writes this 
information to a standard output (STDOUT) which has typically been 
redirected to a convenient file 84. 

Following step 128, viz2d step 130 produces a two dimensional 
plot of top scoring claims, where a score indicates the relative 
10 similarity between two claims. Scores are based on word vector analysis. 
Simultaneously, claim information is aggregated to the patent level in 
order to depict relationships between patents based upon the similarity of 
their claims. Claim matches are aggregated together to provide a ranking 
method (based on a voting- type technique, a technique well known to those 
15 having ordinary skill in the art) . For patents, this is useful in 
producing "company A vs. company B" type displays. 

In a preferable embodiment, after the top matching pair of 
claims (i.e., the two claims having the most similarity) in the data set 
is found, the system rounds the score down to the nearest multiple of 5. 

20 Call this score X. Next, three regions are defined. The top region is 
defined as the rounded score to the rounded score +5 (x to x+5) . The 
middle region is defined as the rounded score -5 to the rounded score -1 
(x-5 to x-1) . The bottom region is defined as the rounded score -15 to 
the rounded score -6 (x-15 to x-6) . 

25 For each pair of patents pi and p2, a comparison is drawn for 

each claim from pi against each claim from p2 and the following number of 
points are added to pi versus p2 . Ten is added if the two claims score in 
the top range. Five is added if the two claims score in the middle range. 
One is added if the two claims score in the bottom range. Zero is added 
30 if the score falls below the bottom range and it is not plotted. Claims 
falling into each range may be distinguished on the two-dimensional plot 
through any appropriate identifier such as color coding or symbols. For 
example, the top, middle and bottom ranges may be plotted with points 
having colors red, blue and gray, respectively. 

35 All claims at or above the bottom range are plotted and the 

top ten patent pairs, as scored by the method described hereinabove, are 
labeled on the graph. The graph is written to a graphs/viz2d. * file 86 
and the top ten patent pairs are also written to a separate 
graphs/viz2d. * . topic file 88. In a preferable embodiment, step 130 
40 employs the UNIX utility gnuplot to generate a postscript plot and then 

uses the gs UNIX utility to convert the output of the prior step to a ppm 
file, which is then converted to a gif file using ppm as is well known by 
those having ordinary skill in the art. An example of such a plot is 
provided in Fig. 8B. 

Returning again to Fig. 2B, viz3d step 132 produces a three 
dimensional plot of top scoring claims while simultaneously aggregating 
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claim information to the patent level. Its functioning is much the same 
as that of step viz2d 130. However, it gives a 3-D projection of the 
results and does not label the top ten matches on the graph. An example 
of such a plot is provided in Fig. 8C. 

5 Finally, viz -compare step 134 produces a cluster plot (also 

referred to as a "scatter plot" of all the claim pairs from a data set. 

In contrast to viz2d step 130 and viz3d step 132, wherein the x-axis is 
one claim number, the y-axis is another claim number, and a dot is plotted 
if that pair of documents scores above the bottom threshold, the method of 
10 viz-compare is that the x-axis represents a wordvec score, the y-axis 
represents an SFC score, and a dot is plotted if there exists a pair of 
claims having the corresponding wordvec and SFC scores. An example of 
such a plot is provided in Fig. 8A. 

The scores plotted in Figs. 8A-8C are used to identify 
15 documents most closely or proximally related; i.e., "proximity scores". 

However, such scores may also be plotted to identify those documents that 
are most different or distally related; i.e., "distal scores". An example 
of the latter may be seen in Fig. 8D (discussed below) . Such distal 
scores may also be plotted in the charts of Figs. 8A-8C. As such, scores 
20 plotted to show relationships among documents are more generally referred 
to herein as utility measures. 

In an alternative embodiment, a user of the system may select 
which plot type(s) desired by selectively engaging steps 130, 132 and/or 
134 of Fig. 2B. 

25 Having detailed the off-line processing component, we now turn 

to the on-line concept query processing aspect of the invention. 

On Line Concept Query Processing 

30 In a concept query, as contrasted to a document query, the 

user has entered an arbitrary text string (which may be user-originated or 
copied from a portion or all of a document) which the system must match 
against the body of known documents to be analyzed (e.g., the dataset) . 
Thus, many of the off-line processing steps described above must be 
35 performed against the on-line entered string to get the text into a usable 
format. Flowchart 401 of Fig. 5 depicts the online query processing. 
Initially, a user's query input to the system is written to an ASCII 
formatted file 82, pursuant to step 400. 

Actual query processing is handled through a shell script, 

40 pursuant to mapit-query step 410. Mapit-query step 410 performs the 

following processing steps: 1) Build a claimlist, 2) preprocess, 3) tf, 

4) tfidfO, and 5) tfidf-all. These are identical in function to the 
following steps in the off-line claims processing section described 
hereinabove: 1) "build claimlist" function of make-claims step 106 in 

45 Fig. 2A. The system builds a claimlist data structure 84 from the user's 
query stored as an ASCII format file 82, pursuant to step 400. The 
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resulting structure is the same in format as claimlist data structure 12 
of Fig. 2. 2) preprocess step 108 in Fig. 2A, 3) tf step 300 in Fig. 4A, 

4) tfidfO step 310 in Fig. 4A, and tfidf-all step 320 in Fig. 4A. The 
output of mapit-query is a set of scores from analysis of the user's 
5 cjuery, which are written to a query weight file 86. 

Following step 410, mapit-process-query step 420 builds a full 
score file 90 from input query weight file 86, containing the weights of 
word stems in the user's query, and a document weights file 88, produced 
during the off-line processing of the document database as described 
10 hereinabove, containing the weights of word stems in the document 

database. The full score file possesses one integer weight 0-99 for every 
document in a body or set of documents being processed. 

Mapit-all step 120 creates a scores file 74 from a weights 
file 74. This file has one integer weight from 0 to 99 for every pair of 
15 documents in the input document dataset. For example, given documents D1 
and D2, corresponding with weight vectors wl and w2 held in a weights file 
(such as the word vector weights file 72, or the SFC weights file 70), 
corresponding normalization constants nl and n2 held in a file (created in 
step 330, and combination function f(wl,w2) defined hereinbelow, mapit-all 
20 determines the maximum of a normalized similarity of weight vector with 
respect to weight vector Wj and a normalized similarity of weight vector W 2 
with respect to weight vector Wj. 

Finally, in step 43 0 the results are converted into a "stars" 
representation. One star is given for any document with a score greater 
25 than zero. An additional star is given for every twenty points in a 

documents score. The stars are displayed to the user as a representation 
of the score . 

In applications where a response time is critical and/or a 
large set of documents requires searching, (e.g., based on weights and 
30 scores) , well-known enhancements may be added to the system to increase 

processing speed such as use of index access method or other techniques to 
optimize fast storage and retrieval of data as are well known to persons 
of ordinary skill in the art. 

In a further embodiment, documents are processed according to 
35 the off-line processing method described hereinabove to the point where 
plots are generated in accordance with steps 130-134 of Fig, 2B. 

Off-line Processing of Non- structured (Generic) Documents 
40 Flow chart 501 of Figs. 6A and 6B describe off-line processing 

of non- structured or generic documents (e.g., technical publications, non- 
structured portions of structured documents (e.g., abstract and detailed 
description of patent), etc.). For the purposes of this discussion, Figs. 
6A and 6B are compared to Figs. 2A and 2B to highlight the differences 
between off-line processing of structured documents, and off-line 
processing of generic documents. Off-line generic document processing 
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begins with creating a file containing the text of the entire document, 
pursuant to step 502. Input to this file is a text formatted file 11 
containing documents in the subject search set. Output is a text file 51. 
Following step 502, a file 53 is created pursuant to step 504, which 
5 contains the full directory path name for each document in file 51. 

Comparing off-line generic document processing with off-line structured 
document processing indicates that there is no analog to the make-claims 
step 106 in generic document processing. Furthermore, single file 54 and 
merged file 56 outputs of the structured document processing make- claims 
10 step do not exist in the generic document processing. 

Processing continues with preprocess step 508. Preprocess 
step 508 is virtually identical to preprocess step 108 (Fig. 2A) of off- 
line structured claim processing. Preprocessing is described in detail in 
Fig. 3 as well as hereinabove. Processing continues with step build-hash 
15 512 (build-claimlist step 110 is omitted from flow chart 501) , which 

creates hashed files 59. These files are a subset of the files 60 created 
in structured document processing and include: 1) A mapping from a claim 

number to a unique document index, representative of each document being 
analyzed; 2) A mapping from a claim number to the full directory path to 
20 the text of that claim; 3) A mapping from a claim number to the first 150 
characters of the claimThe f ix-patentlist step 114 of structured document 
processing (Fig. 2A) is omitted in the generic document processing of Fig. 
6A. The generic processing continues with mapit -process-generic step 516. 
The mapit-process-generic step is virtually identical to the mapit-process 
25 step 116 of structured claim processing. Mapit-process is described in 
detail in Figs, a and 4B and herein above. The output of mapit-process- 
generic step 516 includes an SFC input file 61 and a mapit . sfc .weights 
file 63. These files are identical to files 60 and 62, respectively, of 
Fig. 2A. Off-line generic processing continues on Fig. 6B with mapit-all 
30 step 520 which builds a scores file 75 from a weights file 63. Since 
there are no structured elements such as claims in generic documents, 
there is no equivalent to the mapit-all by patent step 122. So generic 
document processing continues with retrofit-sfc step 520, which functions 
as its counterpart retrofit-SFC step 124 in Fig. 2B. Retrofit-SFC step 
35 520 applies word vector analysis information to the SFC weighted scores, 

producing a new SFC score file 81 and saving the original information in 
an original file 79. The processing continues with mapit -top- scores step 
526 which creates a file 83 of top scores. Finally, mapit -score- range step 
528 computes the minimum and maximum scores and writes them into file 85. 

40 This information may be output as an individual data file using 
conventional means. 

"Generic" documents in this context may include claims treated 
as a generic document (i.e., without parsing) compared with other portions 
of a patent (e.g., summary, abstract, detailed description, etc.). 

In an alternative embodiment, it is contemplated that plot 
generation including two dimensional, three dimensional and cluster will 
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be available with generic document processing. This feature will be 
enabled in accordance with the methodolgy discussed above for structured 
document processing. 



5 

Claim Parsing Accordincr to a Specific Embodiment 

Fig. 7A shows a Flow chart 650 with a method of parsing claims 
according to a specific embodiment of the invention. In a preferred 
embodiment, the method of flow chart 650 is employed by step 106 of Fig. 

10 2A to create files 54 and 56. The input to the claims parsing process is 
a single file containing a set of all claims from a patent (e.g., 
justclaims 50) . The output is a single file and a merged file for each 
claim. The single file will contain only the body of a single claim. The 
merged file will contain the body of the single claim in addition to the 
15 transitive closure of all claims referenced therein. These files are 
identical to files 54 and 56, respectively, of Fig. 2A. 

The process reads claims from the justclaims input file 50, 
one line at a time in the "get the next line" step 600. The system then 
determines if the line read in step 600 is the start of a new claim in 
2 0 step 602. New claims are indicated to the system by a fresh line starting 

with a claim number followed by a period, a space and the claim text. 

Claim numbers must be sequential and begin with the number 1. If the 
system detects the beginning of a new claim then the system will add the 
current claim that it had been processing to the claim list file 12 (list 
25 of document names) in step 604. Otherwise, or in any event, in step 606 
the system appends the current line read in from the file to the current 
or new claim body. Next, the system will determine whether another claim 
is referenced in the current line read in from the file, pursuant to step 
608. If a reference is indicated, the system will read in the next line 
30 from the input file in step 610. This is done in case the reference 
crosses a line boundary. The system will also try to identify claim 
references in step 610. Note that there are two simplifying assumptions. 
Number one, claim references never run more than two lines. Number two, a 
new claim reference is never detected on the second line which continues 
35 to the third line. 

In the alternative, or in any event, the system tokenizes the 
line saving the tokens into an array pursuant to step 612. All matter in 
the line up to the word claim is discarded. For example, in the line, "5. 
The method of claim 1", this step would eliminate all text prior to the 
40 word "claim", i.e., "5. The method of". Tokens are not split based upon 
punctuation because it creates extra tokens. Ending, or trailing 
punctuation is removed from the end of words in step 614. The last word 
in the line is saved in a variable "last_word" in step 616 to facilitate 
the check for the words "preceeding" or "previous" in step 622 of Fig. 7B. 

Having tokenized the line into words, the system will now 
invoke process words in step 618, as described below, to look for 
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references to other claims within the line. Upon completing step 618, a 
determination is made as to whether there are any more lines in the input 
file (i.e., just claims 50 ) in step 619. If yes, control returns to step 
600 to process the next line in the current or a new claim. If not, 
parsing is complete for the set of claims of the subject patent and 
parsing processing stops (unless another patent is to be processed) . 

Referring to flow chart 652 in Fig. 7B, words in the array are 
processed serially beginning with the "get next word" step 620, which 
fetches a current word. The system checks for the existence of the word 
"previous” or "proceeding" in step 622. If the "last_word" was previous 
or proceeding, then the system understands this to indicate that it should 
add all claims including this one to claim list file 12 in step 624 . In 
the alternative, processing proceeds with the system checking for a plain 
(i.e., arabic) number in step 626. If the system detects a plain number 
then the system understands this to indicate that a new claim has been 
found, and that the current claim should be added to claim list file 12, 
pursuant to step 628. In the alternative, the system next checks the 
current word for an "or" an "and" or an "inclusive" in step 630. If the 
system detects the presence of either of these three words this word is 
skipped and no processing is done in step 632. In the alternative, 
processing proceeds with examining the current word for a hyphenated range 
pursuant to step 634 (for example, claims 4-19) . If the system detects 
the presence of a hyphenated range, the system adds the claims in the 
range to claim list file 12, in accordance with step 636. In the 
alternative, processing proceeds to check for the existence of a range 
delimited by the words "to" or "through" in step 638. If the system 
detects a "to" or a "through" delimited range, the system adds the claims 
in the range to claim list file 12, pursuant to step 640. In the 
alternative, the system detects the condition that there is nothing more 
to reference. At this point, the system has detected that this is the end 
of the claim reference. Processing continues with the system searching 
for another claim reference within the subject line, pursuant to step 642. 
Next, in step 644, the current word is saved in the "last_word" variable. 
The system next determines whether there are any more words in the subject 
line being processed, pursuant to step 645. If not, control returns to 
step 619 of flow chart 651. Otherwise, in preparation for another 
iteration through the loop, control flows back to the beginning of the 
process- words step, where the "get next word" step 62 0 is executed to 
process the next word in the set of words. 

Ultimately, when all of the words in a line are reached, 
control flows to step 619 in Fig. 7A, which detects if the last line of 
the claim has been processed. If so, processing halts for this claim. 
Otherwise, control returns to the get-next-line step 600. 

Graphical Display and Visualization of Analysis Results 

Figs 8A-8D illustrate examples of formats in which to display 
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and analyze document data as provided by a particular embodiment of the 
invention. 

Typical clustering techniques, known in the art, represent 
documents as points in an n-dimensional display, wherein each point 
5 corresponds to a single document and each dimension corresponds to a 
document attribute. These clusters are then typically displayed as 
graphical images where related documents are indicated by spatial 
proximity (sometimes further distinguished by color or shape) . Examples 
of this sort of clustering include the "Themescape” type displays from 
10 Battelle, a corporation with headquarters in Columbus, Ohio. 

Contrastingly, according to the invention, clustering is 
performed using a single point in n-dimensional space to represent a pair 
of documents, rather than a single document. Each dimension represents a 
separate metric measuring the similarity of the two documents. By using 
15 different sets of orthogonal metrics, clustering of underlying documents 
can be performed in different ways to highlight different features of the 
overall collection. 

A set of metrics can be selected for display. For example, 
Fig. 8A depicts two orthogonal similarity metrics which scores: thematic 
20 similarity 702 (in the form of semantic thread score or SFC-type score) 
identifying documents about the same topic even if they use different 
terminology, and syntactic similarity 704 (in the form of word vector 
score) which identifies documents that use the same terms and phrases. 
These metrics may employ differing matching techniques. For example, a 
25 subject field code (SFC) vector technique may be combined with a space 
metric based on TF.IDF weighted term occurrences. 

Preferably, thematic similarity is determined employing SFC 
techniques described by Dr. Elizabeth Liddy in the above-referenced 
articles. Further, syntactic similarity is determined through word-vector 
30 analysis using TF.IDF techniques, which are well known to those having 
ordinary skill in the art and more further described in the Salton 
reference. In a preferable embodiment this set of metrics is displayed 
visually as an x-y scatter plot, as in Fig. 8A, although clusters can be 
displayed within larger dimension sets by using additional graphical 
35 attributes such as 3D position, size, shape, and color. 

Many systems use a combination algorithm to collapse multiple 
similarity measures into a single value. According to the invention, the 
individual similarity components in the visual display are retained, 
allowing the user to interpret the multiple dimensions directly. For 
40 example, for certain patent applications, it may be useful to identify 
document pairs that are similar across both dimensions, while for other 
applications it may be more important to identify cases where the two 
similarity scores differ. The user can interactively explore the 
visualization by using a mouse or other input device to indicate either a 
single point (a single pair of documents) or regions of points (a cluster 
of document pairs) . The documents represented by these points can then be 



45 




wo 98/16890 



27 



PCT/US97/18712 



displayed, either by presenting full text or by presenting identifying 
attributes such as title and author. The ability to cluster and display 
documents using multiple similarity measures simultaneously would be lost 
if everything were collapsed to a single score. 

5 

Scatter Diagram: 

Fig. 8A illustrates a scatter plot for drawing inferences from 
A vs. B types of analyses according to the method described above in the 
viz-compare step 134 of Fig. 2B. A collection of documents may be split 
10 into two sets, for example, patents from Company A and patents from 
Company B. Paired Proximity scores are developed, using the method 
described hereinabove, one score for every document in set A against every 
document in set B, and the other score for every document in set B against 
every document in set A. 

15 In the scatter plot, the x-axis represents relative similarity 

according to a syntactic or word vector based score. The y-axis depicts 
relative similarity based on a conceptual or semantic thread based score. 
In a split dataset, each document from the first dataset is compared 
against the documents of the other dataset, resulting in a score 
20 represented by a point in the space defined by the syntactic and semantic 
axes. Documents which are highly similar according to word vector based 
analysis will appear farthest to the right on the plot. Documents having 
the highest similarity according to a semantic based analysis will appear 
at the top of the plot. Documents having the greatest similarity to one 
25 another based upon both word vector and semantic thread score will appear 
in the upper right hand corner of the plot . Documents having the least 
amount of similarity according to both word vector and semantic scores 
will appear in the lower left hand corner of the plot. 

In a related embodiment, the highest proximity scores for each 
30 document in set A against entire set B, and highest proximity scores for 
each document in set B against entire set A are determined. 

In a related embodiment, zooming-in or zooming-out in a 
scatter plot increases or decreases the resolution and range/domain of the 
plot. 

35 

2-D Diagram: 

Fig. 8B illustrates a 2D visualization of an analysis 
conducted on two sets of patents according to the method described in the 
viz2d step 130 of Fig. 2B. In the 2-D plot, the x-axis exhibits the 
40 patents in the dataset as monotonically increasing sequence of patent 
numbers. The y-axis is identical to the x-axis. Clusters of the most 
similar patents within the dataset are plotted on the graph. Clusters 
with scores falling within the 95 to 100 range are plotted with a scjuare. 
Clusters with a score falling within the 90 to 94 range are plotted with a 
cross. Clusters with a score falling within the 80 to 89 range are 
plotted with a circle. 
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In a related embodiment, color is added to the 2D, orthogonal 
similarity plot according to various criteria. For example, if the user 
types in a search concept "digital image segmentation and edge detection, " 
patent components shown in the plot will change color (or some other 
5 display appearance attribute) according to the strength of presence of 
this concept in the data. This may be carried out with an overlay plot 
applied to the 2-D diagram. 

3-D Diagram; 

10 Fig. 8C illustrates a 3D visualization of an analysis 

conducted on two sets of patents according to the method described in the 
viz3d step 132 of Fig. 2B. The 3-D diagram depicts the same information 
as the 2-D diagram only in a three dimensional format. The x-axis and y- 
axis both are delienated by monotonically increasing numbers of the 
15 patents in the dataset. The z-axis represents a ranged degree of 

similarity of the patents. Scores based on the similarity of clusters of 
patents are plotted in the 3-D framework with the same graphical 
representations as in the 2-D plot described hereinabove, (i.e., scores 
within the 95 to 100 range are depicted as a square; scores within the 90 
20 to 94 range are depicted with a cross; scores falling within the 80 to 89 
range are depicted by a circle) . 

S- Curve Diagram: 

Fig. 8D illustrates an S-curve plot for drawing inferences 
25 from A vs. B types of analyses. In this method of displaying data 

analysis results, documents from dataset A are plotted on the left hand 
side with low proximity scores having negative values with large absolute 
values, and where documents from dataset B are plotted on the right hand 
side with low proximity scores having positive values with large absolute 
30 values. In other words, plot (score - 1.0) for set A documents and (1,0 - 
score) for set B documents, then sort and plot to yield an S -shaped 
curve) . 

Fig. 8E illustrates the steps to produce the S-curve. The 
process depicted in the flow chart 801 begins with the generation of all 
35 scores either term or concept from a claim level data set A versus data 

set B analysis 850. For example, the patents from Company A compared with 
the patents from Company B on a claim by claim basis. These scores are in 
the range of 0.0 to 1,0. Next, in step 852, all claims are sequentially 
numbered such that the first claim from Company A is 1 and the last claim 
40 from Company B is n and all claims from A precede all claims from B. In 
step 854, for each claim index I from Company A find the closest claim 
from Company B and record the pair (I, S-1.0), for S is the similarity 
score of A compared with B. Next, in step 856, for each claim index I 
from Company B find the closest claim from company A and record the pair 
(I, 1.0-S) where S is the similarity score of A compared to B. Finally, 
in step 858, sort all pairs in increasing order of second coordinate and 
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display on a plot where the x-axis represents the claim index and the y- 
axis represents the claim score. 

The result is a plot in the form of an S-curve where the 
bottom part of the S represents claims unique to company A; the middle 
5 part represents claims with possible overlaps between the two companies, 
and the top part represents claims unique to Company B, 

In a related embodiment, the S- curve method of displaying data 
is extended to analyses wherein additional documents are added to sets A 
and/or B and reanalyzed. The resulting graph is overlaid on top of the 
10 original graph. This permits the user to track changes over time, for 
example where changes in the shape of an S-curve of patent portfolios 
represent changes in the technology holdings of one company relative to 
the other. 

15 Techniques for Analysis of Documents 

Screens (also referred to as "pages” herein) and automated 
tools incorporated in a specific embodiment of the invention enables a 
user to perform detailed study on analysis results of the system. Fig. 

9A, for example, depicts a representative sign-on screen for a user 
20 according to the invention. Screens are produced using the NetScape 

NetBrowser interface to the worldwide web. The reader of ordinary skill 
in the art will appreciate that other web browsers and other programs may 
be used as a user interface to the patent analysis aspect of this 
invention. The user enters a user I.D. and password in the screen 
25 depicted by Fig. 9A to sign-on to the system described herein. After the 
password and I.D, have been authenticated, in one embodiment of the 
present invention, a dataset representing a portion of the U.S. patent 
database (e.g., over 2 million patents) is automatically selected. In 
another embodiment, it is necessary to choose an initial dataset to 
30 analyze. Exemplary dataset types include; Portfolio Analytics, Custom 
Canvas, Products, World Patents and Industry Verticals. 

Portfolio Analytics contains patent datasets (i.e., sets of 
patents) . There are two types of patent sets: single and split. Single 
patent sets contain all patents together in one group. All search and 
35 analysis functions are applied to all of the patents and claims in the 

patent set. In contrast, split sets contain two groups of patents. These 
two independent patent groups are measured against each other during 
comparative analysis. For example, if a split set contains information 
about company A in one patent group and company B in another group, then a 
40 claim query or patent query with a patent from the company A group will 
display the company A item versus a company B item. An exemplary screen 
shot of dataset selection is provided in Fig. 9B. 

The remaining exemplary datasets include Custom Canvas (which 
will contain user-defined sets). Products (which will contain product 
datasets for patent versus product analysis) , World Patents (which will 
contain patent sets grouped by geographical region) and Industry Verticals 



45 




wo 98/16890 



30 



PCT/US97/18712 



(which will contain industry-specif ic patent sets) . 

Figs. 9C - 9K depict representative screens in accordance with 
the performing of a concept query as described herein above. A concept 
query entry screen 900 depicted in Fig. 9C enables the user to enter in 
5 English text, a description of a concept which the system will search for 
in the database of patents. The concept entry screen has fields which 
enable the user to specify a job I.D. for billing purposes and to search 
sections by abstracts or claims and also to control the order of sorting. 
Further, screen 900 provides a NASA Thesaurus link 902 which, when clicked 
10 upon, launches a Netscape window with the index of the NASA Thesaurus. A 
term fo\md in the thesaurus may be included in the query by copying and 
pasting or simply typing the word into query box 904 . 

Screen 900 also includes a search selection box 906 which is 
used to define the scope of a query and results. The options for box 906 
15 include "claims," "patents (best claim)" and "patents (all claims)." In 
the "claims" option, the system searches each individual claim in the 
selected dataset and returns a results list ranked by claim score. The 
results list, as shown in the screen of Fig. 9E, displays patent 
information 916, 923 as well as claim information 918, including a preview 
20 of the claim text 920. 

In the "patents (best claim) ” option, the system searches each 
individual claim in the selected dataset and returns a results list ranked 
by patent, where the patent score is based on the score of the highest 
ranked claim in the patent. The results list displays patent information, 
25 In the "patents (all claims) " option, the system searches the 

combined (i.e., all) claims for each patent and returns a results list 
ranked by patent, where the patent score is based on a score for all the 
claims in the patent. The results list, as shown in Fig. 9F, displays 
patent information 926, 928. 

30 Referring again to Fig. 9C, clicking on Analyze Query button 

903 produces a concept cjuery review screen 909 of Fig. 9D, which depicts 
the results of the stemming operations described hereinabove as applied to 
the user's concept query which has been entered in screen 900 of Fig. 9C. 
For each. stemmed word and phrase entered in the concept query, the concept 
35 query review screen indicates the number of claims 912 and patents 914 

containing each word or phrase. By clicking a "show results" button 915 
on screen 909, the user may go to a "concept query results" screen 
depicted in Fig. 9E (for a "claims" search) or 9F (for a "patents (all 
claims) " search) . 

40 Referring to Fig. 9E, a concept query results screen 917 

provides the results of a user's "claims" search as applied to the 
database of patents. For the representative query depicted in box 919 of 
Fig. 9E, the results are provided in a list ranked by claim score. The 
Relevance level 921 of any given claim is indicated by the number of stars 
from one (worst) to five (best) . A user may click on a rank number 922 to 
move to a screen showing a side-by-side comparison of the associated claim 
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and the original query (Fig. 91) . Additionally, a user may click on a 
patent number 916 to move to a screen showing the full text of the patent 
(Fig. 9K) and on a claim number 918 to move to a screen showing the full 
text of the claim (Fig. 9J) . These linked screens are described in more 
5 detail below. 

Screen 917 (like many screens described below) contains a 
number of "links" to other screens in forms which include rank numbers, 
patent numbers and claim numbers. These inter-screen links may be 
achieved using HTML (e.g,, via hyperlinks) or any other conventional 
10 method known to those having ordinary skill in the art. Such links 
provide a convenient and well-known mechanism to "navigate" between 
screens containing information desired by a user. In a preferred 
embodiment of the invention, clicking on a claim number, patent number or 
rank number in any screen in which such numbers represent links will call 
15 a "viewer" function, which loads the relevant text described above into a 
separate window. 

More specifically, clicking on a rank number 922 results in a 
link to a viewer side-by-side comparison screen in the form of screen 970 
of Fig. 91. As shown therein, the left half of the screen 972 contains 
20 the full text of a concept query while the right side of the screen 974 

includes the title, assignee, patent number, and full text of the subject 
claim. According to one embodiment, if a subject claim refers to a 
previous claim (i.e., it is a dependent claim), all the claims referenced, 
either directly or indirectly (i.e., the transitive closure of the subject 
25 claim) will be shown in the order referenced. According to another 

embodiment, if a subject claim refers to a first previous claim, the first 
previous claim number will be in the form of a link embedded in the text 
of the subject claim. This link will be to a screen containing the text 
of the first previous claim. In like fashion, if the first previous claim 
30 refers to a second previous claim, a second link (in the form of the 

second previous claim number) will be embedded in the text of the first 
previous claim to a screen containing the text of the second previous 
claim. This daisy chain of links continues until the family of claims is 
traced back to the associated independent claim (s) . 

35 Referring again to Fig. 91, patent number 986 in Fig. 91 

functions as a link to a screen containing the full text of the subject 
patent. In addition, highlighting controls 976-982 are provided in this 
screen. Such controls allow a user to highlight text in any of the text 
areas displayed using two colors. Words or phrases are inserted into 
40 boxes 976 and 980, and desired colors are chosen in boxes 978 and 982, 
respectively. Upon clicking update button 984, the desired words and 
phrases in all of the text windows will be highlighted using the colors 
indicated 978, 982 for each text box 976, 980. 

Referring back to Fig. 9E, clicking on a claim number 918 
links to a claim viewer screen in the form of screen 980 of Fig. 9J. As 
shown therein, this screen is essentially the same as screen 970 (like 
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reference numbers refer to like features) without left-half portion 972. 
Again, if a sxabject claim refers to a previous claim, in one embodiment 
the transitive closure of the claim (i.e., all claims referenced either 
directly or indirectly) shall be shown in the order referenced. 

5 Alternatively, in another embodiment the subject claim shall include in 

its text the claim number of the directly referenced claim (s) in the form 
of a link to another screen or screen (s) containing the text of the 
referenced claim (s) . 

Referring back to Fig. 9E, clicking on a patent number 916 
10 links to a patent viewer screen in the form of screen 990 of Fig. 9K, As 
shown therein, this screen has many of the same features as screens 970 
and 980 (like reference numbers refer to like features) . In addition, 
screen 990 includes window 992 which may contain the full text of the 
subject patent. Alternatively, in another embodiment, window 992 may 
15 display an abbreviated disclosure including the patent title, assignee, 
bibliographic information, abstract and full text of claims. Whether a 
full text or abbreviated disclosure of the subject patent is provided when 
clicking on a patent number may be determined by the type of dataset being 
searched; not all datasets will necessarily contain full text documents. 

20 In addition to the "claims'* based results shown in Fig. 9E, a 

concept query results screen 925 in Fig. 9F gives the results of a user's 
*'patents (all claims)" search as applied to the database of patents. In 
the representative query depicted in Fig. 9F, the patents are listed in 
order of decreasing relevance to the user's concept query (shown in block 
25 930) . Patents are ranked in numerical order and a patent number 926 is 

given along with a title and an assignee 928. Next, the user may by 
clicking on a patent number 926, move to a screen showing the full text of 
the patent (in the same form of Fig. 9K) . In an alternative embodiment, 
the user may by clicking on patent number 926, move to a screen which 
30 provides an "abbreviated" disclosure of the subject patent. This 

abbreviated version may be in the form described above in connection with 
Fig. 9K or, alternatively, in the form of screen 950 of Fig. 9G. 
Specifically, window 951 of screen 950 provides an abbreviated section 
describing the inventors, assignees, filing dates, categories and classes 
35 of the subject patent. Also included is a table of U.S, references, 
abstract, and the claims of the patent (not shown) . As noted above, 
whether a full text or abbreviated disclosure of the subject patent is 
provided when clicking on a patent number may be determined by the type of 
dataset being searched; not all datasets will necessarily contain full 
40 text documents. (This is also true for patent and claim queries, 

described below.) Window 951 also includes a View Image link 952 which, 
when clicked, will launch a new Netscape window from a particular server 
site (e.g., http://my_patent_site.com) containing images and will load a 
scanned image of the subject patent into the window. 

Screen 925 of Fig. 9F also includes a Modify Query link 927 
which may be clicked on to return to the original query. 
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Referring again to Fig. 9F, the user may also click on a rank 
number 932 and move to a patent viewer side-by-side screen 960 as depicted 
in Fig. 9H. The patent viewer screen 960 of Fig. 9H enables the user to 
have a side-by-side comparison of the concept query entered and the text 
5 of various patents which match the concept query according to the system. 

The full text of these documents is presented simultaneously on a computer 
display, enabling the user to interactively explore a comparison of the 
two documents. Alternatively, a subset of the text may be provided that 
includes the abstract, claims and/or bibliographic information. The 
10 format, as noted above, may be determined by the type of dataset being 

searched. In addition, screen 960 includes highlighting controls 976-982 
like those of Fig. 91. 

Figs. lOA, lOB, IOC and lOD depict representative screens in 
accordance with the performing of a patent query as described hereinabove. 
15 The patent query allows the user to draw comparisons between a single 

patent and all other patents in the dataset. If the dataset is a single 
dataset (i.e., not a split dataset) the patent query will compare the 
selected patent to all of the patents in the selected dataset. If the 
selected dataset is a Split dataset (i.e., having at least two data 
20 groups) , the selected patent is compared just to the group of patents that 
it is not in. 

A patent query entry screen 1000 depicted in Pig. lOA enables 
the user to enter the number of a patent contained in the database of 
patents in block 1002. The system will analyze all members of the 
25 database of patents against the patent entered. (However, when "Filter 
Out Claims" selector 1005 is checked, the system will not compare claims 
from the same patent.) Like concept query, the patent query screen has a 
search field 1004 which enables the user to select search processing for 
"patents (all claims)" or "patents (best claim)." In "patents (best 
30 claim) " processing, the patent is compared to each individual claim in the 
selected dataset (for a single dataset) , or to each individual claim in 
the data group not containing the selected patent (for a split dataset) , 
and returns a results list ranked by patent. The patent score is the 
score of the highest ranked claim in the patent. The results list 
35 displays patent information and has an option to view a listing of all the 
ranked claim pairs for any patent in the results list. 

In "patents (all claims) " processing, the patent is compared 
to all of the combined claims for each patent in the selected dataset (for 
a single dataset) , or to an amalgamation of claims for each patent in the 
40 data group which the selected patent does not belong to (for a split 

dataset) , and returns a results list that ranks each matching patent based 
on a score for all the claims in the patent. The results list displays 
patent information and has an option to view a listing of all the ranked 
claim pairs for any patent in the results list. 

Returning to Fig. lOA, clicking on Show Results icon 1003 
displays the query patent number, title and assignee information at the 
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top of a results screen 1010, as shown in Fig. lOB. 

The patent query results screen 1010 depicted in Fig. lOB 
gives the results of the user's search as applied to the database of 
patents. In the representative query depicted in Fig. lOB, the patents 
5 are listed in order of decreasing relevance to the user's query. Patents 
are ranked in numerical order and the patent number 1012 is given along 
with the title and an assignee 1014. As shown in Fig. lOB, a patent query 
generates two scores for each result; a Phrase Score 1018 and a Theme 
Score 1020. Phrase Score 1018, generated from word-vector analysis, 

10 measures similarities based upon words and phrases in claims. Theme Score 
1020, generated from semantic thread analysis, measures similarities based 
upon topical themes and concepts. The score used to sort is displayed in 
bold. 

Screen 1010 provides several navigational links between 
15 screens. For example, by clicking on a patent number 1012 the user may 

move to a screen which displays the entire text of the patent (in the same 
form as shown in Fig. 9K) . Alternatively, the user may click on a "view 
claims" link 1016 to arrive at a claims comparison screen 1030 depicted in 
Fig. IOC. Claims comparison screen 1030 permits the user to identify 
20 matching claim pairs between the two patents at issue. 

Referring to Fig. IOC, screen 1030 includes query patent 
number 1032 and results patent number 1034. These patent numbers form 
links to screens displaying the entire text of the patents (in the same 
form as shown in Fig. 9K) . The matching claim pairs for the two patents 
25 are listed in rank order; e.g., rank 1 (claims 20, 2) and rank 2 (claims 

21, 2) . Corresponding claim numbers 1036 form links to screens displaying 
the entire text of the claims (in the same form as shown in Fig. 9J) . 
Further, rank numbers 1038 form links to a side-by-side viewer screen in 
the form of screen 1040 of Fig. lOD. 

30 As shown in Fig. lOD, screen 1040 has many of the same 

features as screen 990 of Fig. 9K (like reference numbers refer to like 

features) . Notably, screen 1040 includes windows 992 which may contain 

the full text of the subject patents. As shown therein, the patent viewer 
screen enables the user to have a side-by-side comparison of the two 
35 patents. The full text of these documents presented simultaneously on a 
computer display enables a user to interactively explore a comparison of 
the two documents. Alternatively, in another embodiment, windows 992 may 
display an abbreviated disclosure including the patent title, assignee, 
bibliographic information, abstract and full text of claims. As noted 
40 above, the type of patent information provided may be determined by the 
type of dataset being searched. 

Additionally, the user may click on a rank number 1013 of 
Fig.lOB, which also links to the side-by-side viewer screen 1040 depicted 
in Fig. lOD. 

Figs. IIA, IIB and IIC depict representative screens in 
accordance with the performing of a claim query as described hereinabove. 
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The claim query allows the user to draw comparisons between a single claim 
and all other claims in the dataset. If the dataset is a single dataset, 
the claim query will compare the selected claim to all of the claims in 
the selected dataset. If the selected dataset is a Split set (having two 
5 data groups) , the selected claim is compared just to the group of claims 
that it is not in. 

Claim query entry screen 1102 depicted in Fig. IIA enables the 
user to enter the number of a patent and a claim contained in the database 
via data entry blocks 1104 and 1106, respectively. The system will 
10 analyze all members of the database against the claim entered. A user who 
is unsure of the correct claim number to enter, may, after entering the 
patent number, click a "view claims” icon 1108, which will display the 
full text of the claims as shown in Fig. IIB. Screen 1120 of Fig. IIB 
displays the entire text of the claims for the patent corresponding to the 
15 patent number entered. The user can scroll through the claims until the 
desired claim is found. 

Referring again to Fig. IIA, the claims query function, as 
specified in block 1110, will compare a selected claim to each individual 
claim in the selected dataset (for a single set) or to each individual 
20 claim in the data group to which the selected claim does not belong (for a 
split set) . It returns a results list ranked by claim. 

Once the user has entered the desired claim and selected a 
"show results" icon 1112, the system responds with matching claims in 
ranked order in screen 1130 of Fig. IIC. As shown in Fig. IOC, a claim 
25 query generates two scores for each result; a Phrase Score 1132 and a 

Theme Score 1134. These scores have the same meaning as Phrase score 1018 
and Theme score 1020, respectively; which are described in connection with 
Fig. lOB. Screen 1130 also provides query patent number 1136 and 
resulting patent number 1137, along with corresponding claim numbers 1138 
30 and 1139, respectively. These patent numbers form links to screens 

displaying the entire text of the associated patents (in the same form as 
shown in Fig. 9K) . The corresponding claim numbers 1138, 1139 form links 
to screens displaying the entire text of the claims (in the same form as 
shown in Fig. 9J) . 

35 In addition, a user may click on a hyperlink rank indicator 

1140 to perform a side-by-side comparison of the claims, resulting in a 
side-by-side viewer screen in the form of screen 1150 of Fig. IID. As 
shown in Fig. IID, screen 1150 has many of the same features as screen 980 
of Fig. 9J (like reference numbers refer to like features) . Each window 
40 1152 and 1154 displays the title, assignee, patent number and full text of 

the matching claims. Like Fig, 9J, if a subject claim refers to a 
previous claim, in one embodiment the transitive closure of the claim 
(i.e., all claims referenced either directly or indirectly) shall be shown 
in the order referenced. Alternatively, in another embodiment the subject 
claim shall embed in its text the claim number of the directly referenced 
claim(s) in the form of a link to another screen or screen(s) containing 
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the text of the referenced claim (s) . 

Alternatively, referring again to Fig. IIC, the user can click 
on "view overlay plot" icon 1142 to view highlights of all the match 
points of the results over the top of a cluster plot. Fig. IIF depicts 
5 the steps in producing the overlay plot. First, as depicted by step 1102 
of flow chart 1101, generate the basic cluster plot for an entire data set 
by offline processing as described hereinabove. Next, according to step 
1104, run a claim query and generate score files, both term and concept 
scores, for each document in the data set against the claim (online) 

10 query. Finally, in step 1106, for each match against the claim i, plot 
ST(i)+e, SC(i)+e on the cluster plot in a contrasting color to the 
original cluster plot; where in ST(i) equals the term score for document 
i, SC(i) equals the concept score for document (i), e equals a random 
epsilon value for spreading. The result is that the dots on the full 
15 cluster plot that correspond to the claim query are highlighted. 

Figs . 12A and 12B depict representative screens in accordance 
with the performing of a range query. The range query allows the user to 
view claim pair matches in the dataset by specifying a score range. If 
the selected dataset is a single dataset the range of every claim in the 
20 dataset is compared to every other claim. If the set is a split set, 

every claim from the first data group will be compared to every claim from 
the second data group. 

The range query entry screen depicted in Fig. 12A enables the 
user to enter a start value and end value for a phrase score and a theme 
25 score and then to select which score is to be used by the system in order 
to rank results. 

The system ranks the results in the range query as depicted in 
Fig. 12B. The results are listed by patent number, title, assignee 
information and the number of lines of each claim. By clicking on the 
30 rank number, the user can view a side-by-side comparison of the two claims 
in the viewer (in the same form as Fig. IID) . Otherwise, by clicking on 
the patent number the viewer can view the full text of the patent in the 
viewer (in the same form as Fig. 9K) . Or, by clicking on the claim to 
view, the user may view the full text of the claim in the viewer (in the 
35 same form as Fig. 9J) . 

Fig. 12C depicts steps in producing a range query. First, as 
shown in step 1202, the user views the cluster plot and decides on an area 
of interest determined by a rectangle. Next, in step 1204, the user 
enters the ranges for term scores and concepts scores ST_min, ST_max, 

40 SC_min, SC_max in accordance with the rectangular region of interest 

determined in prior step 1202. The result is a result page showing only 
the matches that have scores in the specified range corresponding to the 
rectangle of the cluster plot. 

The automated highlighting in the user query screen enables 
45 the highlighting of documents displayed side-by-side on the same display 

where any occurrence of words or phrases from one or more predefined lists 
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are highlighted visually. Automated highlighting may also be used where 
any occurrence of words or phrases specified by the user (or any words or 
phrases automatically generated by one or more sets of rules specified by 
the user) are highlighted visually. 

5 In a related embodiment, the type of highlighting can be 

varied to indicate the list to which the highlighted word or phrase 
belongs, the words or phrases can be highlighted only when they occur in 
both documents. 

An alternative embodiment of the present invention is provided 
10 in Figs, 13A-13G. Included in this embodiment is a claim selection 

operation as shown in screen 1310 of Fig. 13B, Specifically, as shown in 
this figure, hyperlink claim numbers 1312 are provided for each claim 
identified in this screen. A user may click on one of these claim numbers 
to view the underlying claim (i.e., each claim number provides a link to a 
15 screen displaying the text of the identified claim) , Further, the 

transitive closure of select claims is provided in screens 1320, 1330 and 
1340 of Figs. 13C, 13F and 13G, respectively. 

While the foregoing is a complete description of a specific 
embodiment of the invention, various modifications, alternative 
20 constructions and equivalents will be apparent to one skilled in the art. 
Although aspects of the invention are described in terms examples of 
analyzing and visualizing patent texts, aspects of the invention are 
applicable to other classes of documents. Therefore, it is not intended 
that the invention be limited in any way except as defined by the claims. 
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WHAT IS CLAIMED IS : 

1 1 . A method of analyzing and displaying information 

2 regarding a plurality of documents, the method comprising the steps of; 

3 generating a set of N different representations of each 

4 document, a given representation being designated the i— representation 

5 where i is an integer in the range of 1 to N inclusive; 

6 for selected pairs of documents, determining N utility 

7 measures, a given utility measure being designated the i^ utility measure 

8 where i is an integer in the range of 1 to N inclusive, the i— utility 

9 measure being based on the respective i~ representations of the documents 

10 in that pair; and 

11 displaying a scatter plot in an area bounded by N non--parallel 

12 axes, a given axis being designated the i— axis where i is an integer in 

13 the range of 1 to N inclusive, where each selected pair is represented by 

14 a point in N- space having a coordinate along the i— axis equal to the i— 

15 utility measure. 

1 2 . The method of claim 1 wherein the set of N different 

2 representations comprises: 

3 a first representation being a conceptual -level 

4 representation; and 

5 a second representation being a term-based representation. 

1 3 . The method of claim 2 wherein the utility measure is a 

2 proximity score . 

1 4 . A method of analyzing and displaying information 

2 regarding a plurality of documents, the method comprising the steps of: 

3 generating first and second different representations of each 

4 document ; 

5 for selected pairs of documents, determining (a) a first 

6 utility measure based on the respective first representations of the 

7 documents in that pair, and (b) a second utility measure based on the 

8 respective second representations of the documents in that pair; and 

9 displaying a scatter plot in an area bounded by first and 

10 second non-parallel axes where each selected pair is represented by a 

11 point having a first coordinate along the first axis equal to the first 

12 utility measure and a second coordinate along the second axis equal to the 

13 second utility measure. 

1 5. The method of claim 4 wherein: 

2 the first representation is a conceptual -level representation; 

3 and 

4 the second representation is a term-based representation. 
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1 6. The method of claim 4 wherein: 

2 the first representation is a subject vector; and 

3 the second representation is a word vector. 

1 7, The method of claim 4 wherein each of the selected pairs 

2 consists of a particular document in the plurality of documents and a 

3 different respective one of the remaining documents in the plurality of 

4 documents. 

1 8 . The method of claim 5 wherein the utility measure is a 

2 proximity score. 

1 9 . The method of claim 4 wherein the documents are 

2 publications. 

3 10. The method of claim 4 wherein the documents are articles 

4 from journals. 

5 11. The method of claim 4 wherein the documents are 

6 attributable to a product . 

7 12 . The method of claim 4 wherein the documents are 

8 contained in a split dataset for making comparisons between collections of 

9 documents . 

1 13 . The method of claim 4 wherein the documents are 

2 different parts of patents . 

1 14, The method of claim 13 wherein the different parts of 

2 patents include claims . 

1 15. The method of claim 13 wherein the different parts of 

2 patents include a detailed description. 

1 16. The method of claim 13 wherein the different parts of 

2 patents include an abstract. 

1 17. The method of claim 13 wherein the different parts of 

2 patents include a summary. 

1 18. The method of claim 13 wherein the different parts of 

2 patents include a Background of Invention. 

1 19. A method of analyzing information regarding a plurality 

2 of documents, each having a unique document index, the method comprising 

3 the steps of : 
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4 parsing each document into a plurality of elements; 

5 generating a first representation of each of said elements; 

6 and 

7 for selected pairs of documents, comprised of a first document 

8 and a second document, determining a first utility measure based on the 

9 respective first representation 

10 of the plurality of elements for the documents in that pair. 

1 20. The method of claim 19, wherein said plurality of 

2 elements are in a hierarchical relationship, further comprising the step 

3 of: 

4 displaying a representation of each of said plurality of 

5 elements reflecting said hierarchical relationship. 

1 21. The method of claim 19 wherein said elements comprise 

2 patent claims. 

1 22. The method of claim 20 wherein said representation is a 

2 hypertext link. 

1 23. The method of claim 20 wherein said representation is a 

2 depiction of a sequence of said plurality of elements organized to reflect 

3 said hierarchical relationship. 

1 24. The method of claim 19, wherein said plurality of 

2 elements are in a hierarchical relationship, further comprising the step 

3 of: 

4 selecting a particular element from said plurality of elements 

5 as a basis for further analysis. 

1 25. The method of claim 19 wherein the parsing step produces 

2 a transitive closure of said plurality of elements . 

1 26 . The method of claim 19 wherein the elements are claims 

2 and the parsing step comprises the steps of : 

3 reading in text; 

4 determining whether a new claim has begun; 

5 tokenizing said text to extract a plurality of tokens; 

6 adding said plurality of tokens to a word list for the claim; 

7 and 

8 scanning said tokenized text for tokens which indicate a 

9 reference to a different claim. 

1 27. The method of claim 19 further comprising the step of 

2 displaying a plot in an area bounded by first and second non-parallel axes 

3 where each selected pair is represented by a point having a first 

4 coordinate along the first axis and a second coordinate along the second 
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axis. 



28. The method of claim 27 further comprising the steps of: 
generating a second representation of each of said elements; 
for the selected pairs of documents, determining a second 

utility measure based on the respective second representation of the 
plurality of elements for the documents in that pair; and 

wherein in the displaying step, the plot is a scatter plot, 
the first coordinate is equal to the first utility measure and the second 
coordinate is equal to the second utility measure. 

29. The method of claim 19 further comprising the steps of; 
generating a second representation of each of said elements ; 
for the selected pairs of documents, determining a second 

utility measure based on the respective second representation of the 
plurality of elements for the documents in that pair, 

30. The method of claim 27 further comprising the steps of; 
wherein in the displaying step, the plot is a 2 dimensional 

visualization, the first coordinate is equal to the unique document index 
of the first document of a pair of documents and the second coordinate is 
equal to the vinique document index of the second member of a pair of 
documents, and an icon representing the first utility measure is plotted 
for each pair of documents. 

31. The method of claim 19 further comprising the step of 
displaying a plot in an area bounded by first, second and third non- 
parallel axes where each selected pair is represented by a point having a 
first coordinate along the first axis, a second coordinate along the 
second axis and a third coordinate along the third axis . 

32. The method of claim 31 further comprising the steps of: 
wherein in the displaying step, the plot is a 3 dimensional 

visualization, the first coordinate is equal to the unique document index 
of the first document of a pair of documents and the second coordinate is 
equal to the unique document index of the second member of a pair of 
documents, and the third coordinate is equal to the first utility measure, 
and an icon representing the first utility measure is plotted for each 
pair of documents . 

33. The method of claim 30 wherein said first utility measure 
is a combination of N utility measures. 

34. The method of claim 32 wherein said first utility measure 
is a combination of N utility measures. 
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35. The method of claim 28, for an additional document 
further comprising: 

parsing said additional document into a plurality of elemental- 
generating a first representation of each of said elements 
from the parsing step; 

for selected pairs of documents drawn such that a first member 
of the pair is the additional document and a second member of the pair is 
from said plurality of documents, determining a first utility measure 
based on the respective first representation of the plurality of elements 
for the documents in that pair; 

generating a second representation of each of said elements 
from the parsing step; 

for selected pairs of documents drawn such that a first member 
of the pair is the additional document and a second member of the pair is 
from the plurality of documents, determining a second utility measure 
based on the respective second representation of the plurality of elements 
for the documents in that pair; and 

wherein in the displaying step, the plot is a scatter plot, 
generating an overlay plot in contrasting color to the scatter plot, the 
first coordinate equal to the first utility measure computed on the pairs 
of documents including the additional document the second coordinate is 
equal to the second utility measure computed on the pairs of documents 
including the additional document. 



36. The method of claim 35 wherein said additional document 
is a textual query entered by a user. 



and 



37. The method of claim 35 wherein: 

the first representation is a conceptual -level representation; 
the second representation is a term-based representation, 

38. The method of claim 37 wherein: 

the first representation is a subject vector; and 
the second representation is a word vector. 



39. The method of claim 19 wherein said step of determining 
a first utility measure further comprises the steps of : 

determining a first intermediate utility measure; 
determining a second intermediate utility measure ; 
selecting a particular intermediate utility measure from said 
first intermediate utility measure and said second intermediate utility 
measure as said first utility measure. 



40. The method of claim 29 wherein said step of determining 
a second utility measure further comprises the following steps: 
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determining a third intermediate utility measure; 

determining a fourth intermediate utility measure; 

selecting a particular intermediate utility measure from said 
third intermediate utility measure and said fourth intermediate utility 
measure as said second utility measure. 

41. The method of claim 39 wherein: 

said first intermediate utility measure is a combination of a 
first similarity measure for said first document element and said first 
similarity measure for said second document element and a first 
normalization constant; and 

said second intermediate utility measure is a combination of a 
first similarity measure for said second document element and said first 
similarity measure for said first document element and a second 
normalization constant. 

42. The method of claim 40 wherein: 

(a) said third intermediate utility measure is a 
combination of a second similarity measure for said first document element 
and said second similarity measure for said second document element and a 
first normalization constant; and 

(b) said fourth intermediate utility measure is a 
combination of said second similarity measure for said second document 
element and said second similarity measure for said first document element 
and a second normalization constant. 

43. The method of claim 41 wherein said first similarity 
measure is a word weight vector. 

44. The method of claim 42 wherein said second similarity 
measure is an SFC weight vector. 

45. The method of claim 19 wherein; 

said pairs of documents further comprises a first document and 
a second document, 

said first document is a dependent claim, x, depending from an 
independant claim, X, and 

said second document is a dependent claim, y, depending from 
an independent claim, Y, 

said determining a first utility measure further comprises the 
following steps: 

determining a first intermediate utility measure; 
determining a second intermediate utility measure; 

combining said first intermediate utility measure and 
said second intermediate utility measure. 
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1 46. The method of claim 29 wherein: 

2 said pairs of documents further comprises a first document and 

3 a second document, 

4 said first document is a dependent claim, x, depending from an 

5 independant claim, X, and 

6 said second document is a dependent claim, y, depending from 

7 an independent claim, Y, 

8 said determining a second utility measure further comprises 

9 the following steps: 

10 determining a third intermediate utility measure; 

11 determining a fourth intermediate utility measure; 

12 combining said third intermediate utility measure and 

13 said fourth intermediate utility measure. 

1 47. The method of claim 45 wherein: 

2 (a) said first intermediate utility measure is a 

3 combination of a first similarity measure for said first document element, 

4 X, and said first similarity measure for said second document element Y; 

5 and 

6 (b) said second intermediate utility measure is a 

7 combination of said first similarity measure for said first document 

8 element, X, and said first similarity measure for said second document 

9 element , Y . 

1 48. The method of claim 46 wherein: 

2 (a) said third intermediate utility measure is a 

3 combination of said second similarity measure for said first document 

4 element, x, and said second similarity measure for said second document 

5 element Y; and 

6 (b) said fourth intermediate utility measure is a 

7 combination of said second similarity measure for said first document 

8 element, X, and said second similarity measure for said second document 

9 element, Y, 

1 49. The method of claim 47 wherein said first similarity 

2 measure is a word weight vector. 

1 50. The method of claim 48 wherein said second similarity 

2 measure an SFC weight vector, 

1 51. The method of claim 45 wherein said step of combining 

2 comprises an averaging. 

1 52. The method of claim 46 wherein said step of combining 

2 comprises an averaging. 
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1 53 . A computer program product which analyzes and displays 

2 information regarding a plurality of documents comprising: 

3 code for generating first and second representations of each 

4 document ; 

5 code for determining, for selected pairs of documents; 

6 (a) a first utility score based on the respective first 

7 representations of the documents in that pair, and 

8 (b) a second utility score based on the respective 

9 second representations of the documents in that pair; 

10 code for displaying a scatter plot in an area bounded by a 

11 first and a second non-parallel axes where each selected pair is 

12 represented by a point having a first coordinate along the first axis 

13 equal to the first utility score and a second coordinate along the second 

14 axis equal to the second utility score; and 

15 a computer readable storage medium for storing the codes, 

1 54. A method of analyzing patent documents comprising the 

2 steps of: 

3 providing a dataset containing a plurality of patent 

4 documents ; 

5 identifying within each patent document a portion of said 

6 document containing a set of claims; 

7 generating a first representation of each set of claims within 

8 said plurality of patent documents; and 

9 determining a first utility measure of at least one claim 

10 within at least one set of claims based upon similarity of said at least 

11 one claim with a query document. 

1 55. The method of claim 54 wherein said query document is a 

2 concept query, patent or claim. 

1 56. The method of claim 54 further comprising the step of 

2 displaying on a computer screen a ranking of a plurality of claims 

3 contained within said patent documents based upon said first utility 

4 measure associated with each of said plurality of claims, said screen 

5 including a claim number and rank number for each of said plurality of 

6 claims . 

1 57. The method of claim 56 further comprising the step of 

2 providing a link at said claim number to a full-text display of an 

3 associated claim. 

1 58. The method of claim 57 further comprising the step of 

2 providing a link at said rank number to a side-by-side textual display of 

3 said associated claim and said query document. 
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1 59. The method of claim 54 further comprising the step of 

2 parsing each set of claims to identify each individual claim within said 

3 each set and all claims referenced by said each individual claim. 

1 60. The method of claim 54 further comprising the steps of; 

2 generating a second representation of each set of claims 

3 within said plurality of patent documents; and 

4 determining a second utility measure of said at least one 

5 claim within said at least one set of claims based upon similarity of said 

6 at least one claim with said query document. 

1 61. The method of claim 60 further comprising the step of 

2 displaying on a computer screen a ranking of said plurality of patent 

3 documents based upon said first and second utility measures associated 

4 with claims of each of said patent documents, said screen including a rank 

5 number for each of said plurality of patent documents . 

1 62 . The method of claim 61 further comprising the step of 

2 providing a link at said rank number to a side-by-side textual display of 

3 an associated patent document and said query document. 

1 63. The method of claim 62 further comprising the step of 

2 providing a link at a screen icon to a textual display of a ranked listing 

3 of matching claims of said associated patent document and said query 

4 document . 

1 64. A method of analyzing a patent document comprising the 

2 steps of : 

3 providing a dataset containing at least one patent document; 

4 identifying within said at least one patent document a portion 

5 of said document containing a set of claims; 

6 parsing said set of claims to identify an individual claim 

7 within said set and all claims referenced by said individual claim; and 

8 displaying on a computer screen a link for each claim 

9 referenced by said individual claim. 

1 65. The method of claim 64 further comprising the step of 

2 displaying on said computer screen at least a portion of said individual 

3 claim. 

1 66. The method of claim 65 wherein, activation of said link 

2 for a particular claim referenced by said individual claim produces a full 

3 text display of said particular claim. 

1 67. The method of claim 66 wherein said link is a claim 

2 number . 




wo 98/16890 



47 



PCT/US97/18712 



1 68. The method of claim 67 wherein said full text display of 

2 said particular claim comprises a transitive closure of said particular 

3 claim. 
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ABSTRACT: A gas turbine generator unit which is small-sized to be portable. The 
generator unit comprises a generator directiy driven by a gas turbine. The rotatable 
shaft of the generator is coaxiaily connected with the rotor shaft of the gas turbine 
engine. An air filter box having an air fiiter element is disposed in a manner that 
the extension of the axis of the generator rotatabie shaft passes through the air 
filter box. The air filter box is located on the opposite side of the gas turbine com- 
pressor with respect to the generator. Additionaliy, an intake silencer for reducing 
intake air noise is disposed fluidly connect the air filter box and the air intake port 
of the compressor. The intake silencer is located to extend generally parallel with 
the axis of the generator rotatable shaft. 

CLAIMS (18): 

What is claimed is: 

■ 1. A gas turbine generator unit, comprising: 

• a gas turbine including a compressor and a rotor shaft; 

• a generator having a rotatable shaft which is coaxially connected with 
said gas turbine rotor shaft; 

• an air filter box having an air filter element and being arranged such 
that an extension of an axis of said generator rotatable shaft passes 
through said air filter box, said air filter box located on an opposite 
side of said generator with respect to said compressor; and 



13 ?1 



Textual Analysis 
with the Viewer 

Figure 3.2: 

The MAPIT Viewer window showing text 
highlighted in two colors. 



FIG. 9H 



















wo 98/16890 



PCT/US97/18712 



26/48 




970 



Netscape - [MAPI! Viewer] 



y[g][x] 



File Edit View Go Bookmarks Options Directory Wndow ]ielp 



Back 1 1 ForwardI | Home 1 1 Reload 1 1 Image 1 1 Open 1 1 Print | Find 1 1 Stop | 



BT 






" A" Phra ses to Highlight (one perj^ine) Color "A" " B" Phrases to Highlight (one per line) Color "B" non 

Blue 



S I Red M 

V ( 

978 
976 



□E 






980 



( 



984 



Update 



m 



Words and Phrases 

apparatus for connecting a gas 
pressure source to several 
beer kegs in series 



3 , 758,008 



-ih 

986 



TAPPING ASSEMBLY FOR BEER KEGS AND THE LIKE 
(None) 



Use for new query 



1. A tapper assembly for beer kegs and the like, comprising in combination; an 
adapter for mounting in an opening in a keg or the like, and including a housing 
containing magnetic material: a liquid passageway in said adapter housing 
containing an inlet and an outlet with a first valve seat therebetween, the inlet 
being adapted to be in communication with the interior of a keg in which the 
adaptor may be mounted; a first valve member associated with said first valve 
seat and movable between an open position and a closed position relative 
thereto, said first valve member being biased toward the closed position by 
means including a permanent magnet; a gas passageway in said adapter 
housing containing an inlet and an outlet with a second valve seat therebetween 
the outlet being adapted to be in communication with the interior of a keg in 
which the adapter may be mounted; a second valve member associated with 
said second valve seat and movable between an open position and a closed 



□I Document Done 



972 



7 



974 



7 






FIG. 91 

























WO 98/16890 


07 / AO 
C.I 1 70 


PCT/US97/18712 






1 

CO 

CO 

o 



Netscape - [MAPIT Viewer] 

File Edit View Go Bookmarks Options ^rectory Window JHelp 
I Back 1 1 ForwardI j Home 1 1 Fteload 1 1 Image 1 1 Open 1 1 Print 1 1 Fmd 



s 








1. A tapper assembly for beer kegs and the like, comprising in combination; an adapter for mounting in an opening 
in a keg or the like, and including a housing containing magnetic material; a liquid passageway in said adapter 
housing containing an inlet and an outlet with a first valve seat therebetween, the inlet being adapted to be in 
communication with the interior of a keg in which the adaptor may be mounted; a first valve member associated with 
said first valve seat and movable between an open position and a closed position relative thereto, said first valve 
member being biased toward the closed position by means Including a permanent magnet; a gas passageway in 
said adapter housing containing an inlet and an outlet with a second valve seat therebetween the outlet being 
adapted to be in communication with the interior of a keg in which the adapter may be mounted; a second valve 
member associated with said second valve seat and movable between an open position and a closed position 
relative thereto, said second valve member being biased toward and closed position by means including a 
permanent magnet; a tapper in selective sealing engagement with said adapter and including a housing; a gas 
passageway in said tapper housing having an inlet and an outlet with third valve means therebetween movable 
between an open position and a closed position, the inlet being adapted to be connected to a source of gas under 
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high frewquency content to modify a video display generated within an electronic viewfinder in order to 
indicate a properly focused video image, the improvement wherein said video apparatus comprises: 
means for generating a control signal that varies according to the high frequency content of the video 
signal as the video image is brought into focus; means for accumulating the control signal of the video 
signal generating the display in the viewfinder whereby the transition in brightness level in the 
viewfinder corresponds to the high frequency content of the video signal. 
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